As technology professionals, we live in an interruption-driven world; responding to incidents is part of the job. All of our other job duties go out the window when a new issue hits the desk. Having the right information and understanding the part it plays in the organization is key to handling these incidents with speed and accuracy. This is why it's critical to have the ability to compare apples-to-apples when it comes to the all-important troubleshooting process.
What is our job as IT professionals?
Simply put, our job is to deliver services to end-users. It doesn't matter if those end-users are employees, customers, local, remote, or some combination of these. This may encompass things as simple as making sure a network link is running without errors, a server is online and responding, a website is handling requests, or a database is processing transactions. Of course, for most of us, it's not a single thing, it's a combination of them. And considering the fact that 95 percent of organizations report having migrated critical applications and IT infrastructure to the cloud over the past year, according to the SolarWinds IT Trends Report 2017, visibility into our infrastructure is getting increasingly murky.
So, why does this matter? Isn't it the responsibility of each application owner to make sure their portion of the environment is healthy? Yes and no. Ultimately, everyone is responsible for making sure that the services necessary for organizational success are met. Getting mean time to resolution (MTTR) down requires cooperation, not hostility. Blaming any one individual or team will invariably lead to a room full of people pointing fingers. This is counterproductive and must be avoided. There is a better way: prevention via comprehensive IT monitoring.
Monitoring solutions come in all shapes and sizes. Furthermore, they come with all manner of targets. We can use solutions specific to vendors or specific to infrastructure layers. A storage administrator may use one solution while a virtualization and server administrator may use another and the team handling website performance a third solution. And, of course, none of these tools may be applicable to the database administrators.
At best, monitoring infrastructure with disparate systems can be confusing, at worst, it can be downright dangerous. Consider the simple example of a network monitoring solution seeing traffic moving to a server at 50 megs/second, but the server monitoring solution sees incoming traffic at 400 megs/second. Which one is right? Maybe both of them, depending on if they mean 50 MBps and 400 Mbps. This is just the start of the confusion. What happens if your virtualization monitoring tool reports in Kb/sec and your storage solution reports in MB/sec? Also, when talking about kilos, does it mean 1,000 or 1,024?
You can see how the complexity of analyzing disparate metrics can very quickly grow out of hand. In the age of hybrid IT, this gets even more complex since cloud monitoring is inherently different than monitoring on-premises resources. You shouldn't have to massage the monitoring data you receive when troubleshooting a problem. That only serves to lengthen MTTR.
In the past, I've worked in environments with multiple monitoring solutions in place. During multi-team troubleshooting sessions, we've had to handle the above calculations on the fly. Was it successful? Yes, we were able to get the issue remedied. Was it as quick as it should have been? No, because we were moving data into spreadsheets, trying to align timestamps, and calculating differences in scale (MB, Mb, KB, Kb, etc.). This is what I mean by data normalization: making sure everyone is on the same page with regard to time and scale.
Single pane of glass
Having everything you need in one place with the timestamps lined up and everything reporting with the same scale — a single pane of glass through which you see your entire environment — is critical to effective troubleshooting. Remember, our job is to provide services to our end-users and resolve issues as quickly as possible. If we spend the first half of our troubleshooting time trying to line up data, are we really addressing the problem?
About the Post
This is a cross-post from my personal blog @ blog.kmsigma.com. [Link].