Except... I've seen more occurrences of Root Cause Paralysis lately. I'll explain. I've seen a complex system suffer a major outage because of a simple misconfiguration on an unmonitored storage array. And that simple misconfiguration in turn revealed several bad design decisions that were predicated on the misconfiguration. Once the incident has been resolved, management demanded a root cause analysis to determine the exact cause of the outage, and to implement a permanent corrective action. All normal, reasonable stuff.
The Paralysis began when representatives from multiple engineering groups arrived to the RCA meeting. It was the usual suspects: network, application, storage, and virtualization. We began with a discussion on the network, and the network engineers presented a ton of performance and log data during the morning of the outage to indicate that all was well in Cisco-land. (To their credit, the network guys even suggested a few highly unlikely scenarios in which their equipment could have caused the problem.) We moved to the application team, who presented some SCOM reports that showed high latency just before and during the outage. But when we got to the virtualization and storage components, all we had was a hearty, "everything looked good." That was it. No data, no reports, no graphs to quantify "good."
So my final questions for you:
Has this situation played out in your office before?
What types of information do you bring with you to defend your part of the infrastructure?
Do you prep for these meetings, or do you just show up and hope for the best?
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community.
More than 150,000 members are here to solve problems, share technology and best practices, and directly
contribute to our product development process.
Learn more today by joining now.