Among those of us that implement and maintain network management systems the topic of root cause analysis is one of much angst and frustration. Of course we'd all like our monitoring systems to be smart enough to correlate hundreds of seemingly unrelated events into some sort of epiphany of what's "really wrong" but as many of us have found out, that's a pretty lofty goal.
Back when I ran a consulting company I loved root cause analysis/event correlation projects. Who wouldn't? It's a great way to generate tons of revenue and keep your engineers highly engaged without ever really delivering anything useful to the customer. Most times, these projects don't end until the customer either runs out of money or your contact there gets fired for funding a project that never went anywhere. Perfect, right?
I've had the priviledge of working in and visiting some of the largest NOCs in the world, and in almost all of them there sits a system that is supposed to correlate network events into meaningful information about where the problems exist - and then there's another product or even in some cases a home-grown ping tool that actually monitors the network...
Root cause analysis is a good thing. The concept of correlating events to get a better understanding of the big picture is also a good thing. Where people tend to go wrong is that they don't head down this road with clear, achiveable milestones in mind and end up basically driving around forever. Failing to define what is "good enough" is a good way to ensure that you'll never end a project like this.
So, how do you get what you really need in terms of suppressing alerts, defining dependancies, and correlating events without getting lost on the road to the Holy Grail of root cause analysis? Well, next week I'll tell you but right now I'm heading out to Northern Illinois for an early goose hunt with Captain Bob from Migratory Outfitters. If you've got suggestions/comments post them here and I'll include them in the list. Until next week...
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community.
More than 150,000 members are here to solve problems, share technology and best practices, and directly
contribute to our product development process.
Learn more today by joining now.