As a longtime IT professional, responding to problems in systems of one type or another has become old hat. It comes with the job description, and one tends to develop habits and methodologies early on in your career. That doesn't mean that I, or anyone else, have developed the best habits, however, and often our methods are quite ineffective indeed.
Standard practice in the industry is for a network operations center (NOC) to monitor some portion of the network for immediate or impending troubles. Companies spend millions of dollars on entire rooms filled with beautiful monitors mounted on walls, desks and workstations built to look as futuristic as possible, low lights at just the right hue, and comprehensive monitoring suites to keep track of it all. The trouble is, this often makes people aware of a problem, but offers nothing in the way of a troubleshooting methodology or tool to actually fix the problem.
As often as not, when some sort of event worth responding to grabs the attention of a NOC engineer, they either call someone or start a trouble ticket, or both. The lucky recipient of the aforementioned prodding then digs into the problem or passes it onto the next person in the chain, with each successive person having to start over in their own domain (compute, network, security, etc.) with new tools and limited information.
This entire approach may seem logical and even expedient, though I suspect that's largely due to a little bit of Stockholm Syndrome and the ever popular "but this is the way we've always done it" argument. I'm not saying that this is a bad approach--or at least that it hasn't always been a bad approach--given the historical dearth of cross-silo troubleshooting tools available on the market. Most of us instinctively knew that this was inefficient, but didn't have a good sense of what we could do about it.
Various tools and paradigms were suggested, developed, sold, and subsequently put on shelves that attempted to fix the full-stack troubleshooting void. Comprehensive network tools are one of the favorites, offering a truly staggering array of dashboards, widgets, alerts, and beautiful graphics in a noble attempt to present the most information possible to the engineers tasked with fixing the relevant problems. Many tools also exist for doing the same thing inside of virtual environments, or on storage arrays, or the cloud, etc., and many are very good at what they do. But they don't do what we need, which is to collapse the silos between IT disciplines into one unified system - until now.
Solarwinds NPM, part of the Orion Platform suite of products, has long been the darling of NOCs everywhere, and with good reason. It is a comprehensive and well-thought-out approach to network and systems monitoring. Collapsing the silos in IT, however, requires more than just a great tool for the NOC, or even a great tool for the network and systems teams. It
requires a tool that is not only useful for all of these teams, but preserves the chain of data (of troubleshooting) as it moves between specialties. In other words, if I'm the systems guy, I want to see the data that the network team is seeing, and the steps they've taken to resolve the problem. The thing is, I want to see it in the system, not a hastily or poorly-crafted email, which is the equivalent of tossing a flaming bag of excrement over the wall on our way out.
NPM 12.1 has taken a stab--a good stab--at solving these problems with the inclusion of a tool called PerfStack. I'll be exploring what this tool can do, and where in the troubleshooting process it fits, in a series of blog posts over the coming weeks. I'll likely also toss in some of my own personal horror stories of troubleshooting problems, as I've had more than my fair share in my past, and confession is cathartic. In the meantime, I'd encourage everyone to check out this already fantastic series of posts on the new tool: