Implementing a Systems Monitoring Tool: What’s in it for You?

I don’t think there’s anyone out there that truly loves systems monitoring. I may be wrong, but traditionally, it’s not the most awesome tool to play with. When you think of systems monitoring, what comes to mind? For me, as an admin, a vision of reading through logs and looking at events and timestamps keeps me up at night. I definitely see that there is a need for monitoring, and your CIO or IT manager definitely wants you to set up the tooling to get any and all metrics and data you can dig up on performance. Then there’s the ‘root cause’ issue. The decision makers want to know what the root cause was when their application crashed and went down for four hours. You get that data from a good monitoring tool. Well, time to put on a happy face and implement a good tool. Not just any tool will do though--you want a tool that isn’t just going to show you a bunch of red and green lights. For it to be successful, there has to be something in it for you! I’m going to lay out my top three things that a good monitoring tool can do for you, the admin or engineer in the trenches day in and day out!

Find the Root Cause

Probably the single best thing a (good) systems monitoring tool can do is find the root cause of an issue that has become seriously distressing for your team. If you’ve been in IT long enough, the experience of having an unexplained outage is all too familiar. After the outage is finally fixed and things are back online, the first thing the higher-ups want to know is “why?” or “what was the root cause?” I cringe whenever I hear this. It means I need to dig through system logs, applications event logs, networking logs, and any other avenue I might have to find the fabled root cause. Most great monitoring tools today have root cause analysis (RCA) built in to their tool. RCA can literally save you hours and days of poring over logs. In discussions about implementing a systems monitoring tool, make sure RCA is high on your list of requirements. 

Establish a Performance Baseline

How are you supposed to know what is an actual event or just a false positive? How could you point out something that’s out of the norm for your environment? Well, you can’t, unless you have a monitoring tool in place that learns what normal activity looks like and what events are simply anomalies. With some tools that offer high frequency polling, you can pull baseline statistics for behavior down to the second. Any good monitoring tool will take a while to collect data and analyze it before producing metrics that have meaning to your organization. Over time, the metrics collected will learn adaptively, and constantly provide you with the most up-to-date, accurate metrics. Things like false positives can eat up a lot of resources for nothing. 

Reports, Reports, Reports

When there are issues that arise, or RCA that needs to be done, you want the systems monitoring tool to be capable of producing reports. Reports can come in the form of an exportable .csv file, .xls file, or .pdf. Some managers like a print out, a hard copy, they can write on and mark up. With the ability to produce reports, you can have a solid history of network or systems behavior that you can store in SharePoint or whatever file share you have. Most tools keep an archive or history of reports, but it’s always good to have the option of exporting for backup and recovery purposes. I’ve found that a sortable Excel file that I can search through comes in very handy when I need to really dig in and find an issue that might be hiding in the metrics.

Systems monitoring tools can do so much for your organization, and more importantly, you! Make sure that when you are looking for a systems monitoring tool, sift through all the bells and whistles and be sure that there are at least these three features built in… it might save your hide one day, trust me!

  • I am trying to hone our process for major incidents to focus on the post mortem stage, especially the AAR. I am stressing the question, "Could monitoring have prevented this outage from occurring?" I don't think my service owners are liking my approach too much. From an Infrastructure perspective we are pretty set...

  • I quite enjoy setting up good monitoring as I know it will help me when one of the 1000+ servers 'has an issue'.  It's great when I can quickly find out what has happened and prove to the network people that it really is their problem  ( emoticons_laugh.png ).   I've recently used Appinsight for SQL to show our DBAs that it isn't the server that is causing database slowness, it is the badly configured database.  That was particularly satisfying. 

  • I actually enjoy my position in monitoring quite well. I once was a Cisco engineer at a large retailer and it was a great gig for many years. Eventually it just became boring to me to do the same think over and over. Year 1, research and design a new network (fun). Year 2, a full (boring) year of project planning. Year 3, a very stressful year of  implementation, down times, working 2nd and 3rd shift.  Wash, rinse, repeat. (oh, and study and re-certify every 3 as well.)

    A few years back I decided I would try something different, Network Management was what it used to be called. I have enjoyed that change a lot. Now instead of being an expert in just one thing I'm good at a lot of things and even not very good at a lot as well emoticons_happy.png

    Now when I show up to work, who knows what I will be working on? Could be any one of many tools, operating systems, or programming languages and how much more fun can that be?

    I've been pretty happy with it!

  • Seems like a good article for mirroring of the new SCM module!

Thwack - Symbolize TM, R, and C