So far this month, we've talked about the difficulty of monitoring complex, interconnected systems; the merging of traditional IT skills; and tool sprawl. You've shared some great insights to these common problems. I'd like to finish up my diplomatic tenure with yet another dreaded reality of life in IT: Root Cause Analysis.
Except... I've seen more occurrences of Root Cause Paralysis lately. I'll explain. I've seen a complex system suffer a major outage because of a simple misconfiguration on an unmonitored storage array. And that simple misconfiguration in turn revealed several bad design decisions that were predicated on the misconfiguration. Once the incident has been resolved, management demanded a root cause analysis to determine the exact cause of the outage, and to implement a permanent corrective action. All normal, reasonable stuff.
The Paralysis began when representatives from multiple engineering groups arrived to the RCA meeting. It was the usual suspects: network, application, storage, and virtualization. We began with a discussion on the network, and the network engineers presented a ton of performance and log data during the morning of the outage to indicate that all was well in Cisco-land. (To their credit, the network guys even suggested a few highly unlikely scenarios in which their equipment could have caused the problem.) We moved to the application team, who presented some SCOM reports that showed high latency just before and during the outage. But when we got to the virtualization and storage components, all we had was a hearty, "everything looked good." That was it. No data, no reports, no graphs to quantify "good."
So my final questions for you: