Root Cause.png

 

I remember the largest outage of my career. Late in the evening on a Friday night, I received a call from my incident center saying that the entire development side of my VMware environment was down and that there seemed to be a potential for a rolling outage including, quite possibly, my production environment.

 

What followed was a weekend of finger pointing and root cause analysis between my team, the virtual data center group, and the storage group. Our org had hired IBM as the first line of defense on these Sev-1 calls. IBM included EMC and VMware in the problem resolution process as issues went higher up the call chain, and still the finger pointing continued. By 7 am on Monday, we’d gotten the environment back up and running for our user community, and we’d been able to isolate the root cause and ensure that this issue would never come again. Others, certainly, but this one was not to recur.

 

Have you experienced similar circumstances like this at work? I imagine that most of you have.

 

So, what do you do? What may seem obvious to one may not be obvious to others. Of course, you can troubleshoot the way I do. Occam’s Razor or Parsimony are my courses of action. Try to apply logic, and force yourself to choose the easiest and least painful solutions first. Once you’ve exhausted those, you move on to the more illogical, and less obvious.

 

Early in my career, I was asked what I’d do as my first troubleshooting maneuver for a Windows workstation having difficulty connecting to the network. My response was to save the work that was open on the machine locally, then reboot. If that didn’t solve the connectivity issue, I’d check the cabling on the desktop, then the cross-connect before even looking at driver issues.

 

Simple parsimony, aka economy in the use of means to an end, is often the ideal approach.

 

Today’s data centers have complex architectures. Often, they’ve grown up over long periods of time, with many hands in the architectural mix. As a result, the logic as to why things have been done the way that they have has been lost. As a result, the troubleshooting toward application or infrastructural issues can be just as complex.

 

Understanding recent changes, patching, etc., can be an excellent way to focus your efforts. For example, patching Windows servers has been known to break applications. A firewall rule implementation can certainly break the ways in which application stacks can interact. Again, these are important things to know when you approach troubleshooting issues.

 

But, what do you do if there is no guidance on these changes? There are a great number of monitoring software applications out there that can track key changes in the environment and can point the troubleshooter toward potential issues. I am an advocate for the integration of change management software into help desk software and would like to add to that some feed toward this operations element with some SIEM collection element. The issue here has to do with the number of these components already in place at an organization, and with that in mind, would the company desire changing these tools in favor of an all-in-one type solution, or try to cobble pieces together. Of course, it is hard to discover, due to the nature of enterprise architectural choices, a single overall component that incorporates all of the choices made throughout the history of an organization.

 

Again, this is a caveat emptor situation. Do the research and find out a solution that best solves your issues, determines an appropriate course of action, and helps to provide the closest to an overall solution to the problem at hand.