cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Implementing a Systems Monitoring Tool: What’s in it for You?

Level 11

I don’t think there’s anyone out there that truly loves systems monitoring. I may be wrong, but traditionally, it’s not the most awesome tool to play with. When you think of systems monitoring, what comes to mind? For me, as an admin, a vision of reading through logs and looking at events and timestamps keeps me up at night. I definitely see that there is a need for monitoring, and your CIO or IT manager definitely wants you to set up the tooling to get any and all metrics and data you can dig up on performance. Then there’s the ‘root cause’ issue. The decision makers want to know what the root cause was when their application crashed and went down for four hours. You get that data from a good monitoring tool. Well, time to put on a happy face and implement a good tool. Not just any tool will do though--you want a tool that isn’t just going to show you a bunch of red and green lights. For it to be successful, there has to be something in it for you! I’m going to lay out my top three things that a good monitoring tool can do for you, the admin or engineer in the trenches day in and day out!

Find the Root Cause

Probably the single best thing a (good) systems monitoring tool can do is find the root cause of an issue that has become seriously distressing for your team. If you’ve been in IT long enough, the experience of having an unexplained outage is all too familiar. After the outage is finally fixed and things are back online, the first thing the higher-ups want to know is “why?” or “what was the root cause?” I cringe whenever I hear this. It means I need to dig through system logs, applications event logs, networking logs, and any other avenue I might have to find the fabled root cause. Most great monitoring tools today have root cause analysis (RCA) built in to their tool. RCA can literally save you hours and days of poring over logs. In discussions about implementing a systems monitoring tool, make sure RCA is high on your list of requirements. 

Establish a Performance Baseline

How are you supposed to know what is an actual event or just a false positive? How could you point out something that’s out of the norm for your environment? Well, you can’t, unless you have a monitoring tool in place that learns what normal activity looks like and what events are simply anomalies. With some tools that offer high frequency polling, you can pull baseline statistics for behavior down to the second. Any good monitoring tool will take a while to collect data and analyze it before producing metrics that have meaning to your organization. Over time, the metrics collected will learn adaptively, and constantly provide you with the most up-to-date, accurate metrics. Things like false positives can eat up a lot of resources for nothing. 

Reports, Reports, Reports

When there are issues that arise, or RCA that needs to be done, you want the systems monitoring tool to be capable of producing reports. Reports can come in the form of an exportable .csv file, .xls file, or .pdf. Some managers like a print out, a hard copy, they can write on and mark up. With the ability to produce reports, you can have a solid history of network or systems behavior that you can store in SharePoint or whatever file share you have. Most tools keep an archive or history of reports, but it’s always good to have the option of exporting for backup and recovery purposes. I’ve found that a sortable Excel file that I can search through comes in very handy when I need to really dig in and find an issue that might be hiding in the metrics.

Systems monitoring tools can do so much for your organization, and more importantly, you! Make sure that when you are looking for a systems monitoring tool, sift through all the bells and whistles and be sure that there are at least these three features built in… it might save your hide one day, trust me!

10 Comments
Level 14

I actually like system monitoring.  I like when someone asks about something that I don't have the answer to off the top of my head and requires a bit of digging and thinking. 

Level 13

It's funny - I was thinking the same thing re: enjoying monitoring.  I don't want it to happen every day, but I always find it interesting (maybe even fascinating) to be presented with a question about performance or some odd issue and I really have to go digging to find it.

That said, the tools you have at hand are huge.  One of the things that I see happening with better and better tools is that the base skills of the new practitioners seem to be lacking since they've never had to figure it out for themselves and don't want to learn how because it's too hard.  No matter how great the tools are you still need a good practical skill set to get into systems  (switches, routers, servers, storage, etc) and be able to know how to dig out issues and know what is going on inside.

What's in it for me?  Forgive me for being candid, but the list starts with:

  • Expense.
    • Buying the new monitoring tool
    • Purchasing / configuring the server or resources in which it will reside
    • Time spent setting up that monitoring--from smtp strings or WMI credentials on the end devices, to learning the (millions of) tie-ins to our network apps and their hardware.
    • Time spent troubleshooting, researching, waiting for Support Tickets to be resolved, installing & testing/troubleshooting Beta products for the next version, etc. 
  • More work.
    • If an old tool is not retired when a new tool is implemented, you spend more time with the tools instead of less. 
    • There comes a time when tools are no longer the answer, and that's when obsolete or ineffective tools must be removed.
    • If you get more work on your shoulders, something else is sacrificed.
      • Your personal performance
      • Your stress / happiness
      • Your home life
      • Your team's budget--another body may need to be hired
      • Your work efficiency (you spend more time documenting, more time on Help Desk tickets, and less time on new projects)
      • Your reputation may decline as others get the idea you can't get the job done, no matter how many tools you are given.
  • Questions.
    • Management wants to know what benefit we get out of every dollar and every minute spent on a monitoring solution.  They want things to be powerful, trouble-free, and doorways into automation that allow us to do more with less funds/people/time/hardware/outages/etc.
    • SysAdmins, DBA's, Apps Analysts, Network Analysts--all want to know what the new monitor will do:
      • Will it cause problems by using too many resources on the target devices/apps?
      • Will it run fast and reliably?
      • How can they use it personally?
      • How can they automate it?
      • How can they get the right alerts, but not too many alerts?
    • End Users want to know how it will make their lives better.
  • Better understanding of what's happening.
    • Part of installing a new monitoring tool means learning the systems it monitors.  You can't set up a successful monitor if you don't know what parts NEED monitoring, or what parts are significant and which ones are not.
    • Learning the new tool itself.  There are so many permutations and tools (29 products at Solarwinds to purchase/install the last time I looked a couple of years ago--and there are more today!) that a person can't keep up with them all.  Monitoring truly becomes a team effort, and Management is keenly aware of the dollars/time spent watching and reacting instead of improving and growing and retiring and changing products and procedures. 
  • Better or more sleep.
    • IF the right monitors are in place, and the right training and right documentation has been provided, and the right alerts are configured, I don't get called out of bed at 3 a.m. on a weekend by the Help Desk.  That's HUGE in my book.
    • Knowing the trends of utilization or errors or temperature--or whatever makes your network or systems work poorly or well--means you can move into Predictive mode instead of being stuck in Reactive mode.  It can even facilitate your ability to create SLA's with your customers.  And that makes for better sleep for me AND for the customers.
    • Great monitoring helps me verify whether SLA's between me and my Service Providers have been violated or kept.  Which can mean refunds, penalties, improved services, or a decision for me to move drop a problem provider and start a new relationship with other Service Providers.
    • Fewer complaints by customers.  This assumes you've not only set up a new System Monitoring Tool, but you're actually using it to take action that improves reliability, performance, availability, etc.  And those all equate to fewer customer complaints, fewer Help Desk Tickets, improved MTTI (Mean Time To Innocence)--and more sleep for you or your team!
  • A happier future
    • When we do well with monitors, we act on what they report.  When we act appropriately we improve customer satisfaction and reduce downtime or slowed production costs.  That equates to fewer hours wasted on problems, more hours growing and improving everything.  And that ties directly into great employee performance reviews.
    • These items can mean job title improvements, bonuses, raises, better work environment, better toys, more staff, and all the good things that come with these.
Level 16

Thanks for the write up! Digging through log files isn't much fun but there are tools out there now that can do it in a big hurry if you learn their search language. 

MVP
MVP

Nice write up

Level 20

Seems like a good article for mirroring of the new SCM module!

Level 13

Thanks another good article.

Level 16

I actually enjoy my position in monitoring quite well. I once was a Cisco engineer at a large retailer and it was a great gig for many years. Eventually it just became boring to me to do the same think over and over. Year 1, research and design a new network (fun). Year 2, a full (boring) year of project planning. Year 3, a very stressful year of  implementation, down times, working 2nd and 3rd shift.  Wash, rinse, repeat. (oh, and study and re-certify every 3 as well.)

A few years back I decided I would try something different, Network Management was what it used to be called. I have enjoyed that change a lot. Now instead of being an expert in just one thing I'm good at a lot of things and even not very good at a lot as well

Now when I show up to work, who knows what I will be working on? Could be any one of many tools, operating systems, or programming languages and how much more fun can that be?

I've been pretty happy with it!

Level 14

I quite enjoy setting up good monitoring as I know it will help me when one of the 1000+ servers 'has an issue'.  It's great when I can quickly find out what has happened and prove to the network people that it really is their problem  ( ).   I've recently used Appinsight for SQL to show our DBAs that it isn't the server that is causing database slowness, it is the badly configured database.  That was particularly satisfying. 

I am trying to hone our process for major incidents to focus on the post mortem stage, especially the AAR. I am stressing the question, "Could monitoring have prevented this outage from occurring?" I don't think my service owners are liking my approach too much. From an Infrastructure perspective we are pretty set...