I don’t think there’s anyone out there that truly loves systems monitoring. I may be wrong, but traditionally, it’s not the most awesome tool to play with. When you think of systems monitoring, what comes to mind? For me, as an admin, a vision of reading through logs and looking at events and timestamps keeps me up at night. I definitely see that there is a need for monitoring, and your CIO or IT manager definitely wants you to set up the tooling to get any and all metrics and data you can dig up on performance. Then there’s the ‘root cause’ issue. The decision makers want to know what the root cause was when their application crashed and went down for four hours. You get that data from a good monitoring tool. Well, time to put on a happy face and implement a good tool. Not just any tool will do though--you want a tool that isn’t just going to show you a bunch of red and green lights. For it to be successful, there has to be something in it for you! I’m going to lay out my top three things that a good monitoring tool can do for you, the admin or engineer in the trenches day in and day out!
Find the Root Cause
Probably the single best thing a (good) systems monitoring tool can do is find the root cause of an issue that has become seriously distressing for your team. If you’ve been in IT long enough, the experience of having an unexplained outage is all too familiar. After the outage is finally fixed and things are back online, the first thing the higher-ups want to know is “why?” or “what was the root cause?” I cringe whenever I hear this. It means I need to dig through system logs, applications event logs, networking logs, and any other avenue I might have to find the fabled root cause. Most great monitoring tools today have root cause analysis (RCA) built in to their tool. RCA can literally save you hours and days of poring over logs. In discussions about implementing a systems monitoring tool, make sure RCA is high on your list of requirements.
Establish a Performance Baseline
How are you supposed to know what is an actual event or just a false positive? How could you point out something that’s out of the norm for your environment? Well, you can’t, unless you have a monitoring tool in place that learns what normal activity looks like and what events are simply anomalies. With some tools that offer high frequency polling, you can pull baseline statistics for behavior down to the second. Any good monitoring tool will take a while to collect data and analyze it before producing metrics that have meaning to your organization. Over time, the metrics collected will learn adaptively, and constantly provide you with the most up-to-date, accurate metrics. Things like false positives can eat up a lot of resources for nothing.
Reports, Reports, Reports
When there are issues that arise, or RCA that needs to be done, you want the systems monitoring tool to be capable of producing reports. Reports can come in the form of an exportable .csv file, .xls file, or .pdf. Some managers like a print out, a hard copy, they can write on and mark up. With the ability to produce reports, you can have a solid history of network or systems behavior that you can store in SharePoint or whatever file share you have. Most tools keep an archive or history of reports, but it’s always good to have the option of exporting for backup and recovery purposes. I’ve found that a sortable Excel file that I can search through comes in very handy when I need to really dig in and find an issue that might be hiding in the metrics.
Systems monitoring tools can do so much for your organization, and more importantly, you! Make sure that when you are looking for a systems monitoring tool, sift through all the bells and whistles and be sure that there are at least these three features built in… it might save your hide one day, trust me!