It’s a fact that things can go wrong in IT. But with the advent of IT monitoring and automation, the future seems a little brighter.


After over a decade of implementing monitoring systems, I’ve become all too familiar with what might be called monitoring grief. It involves a series of behaviors I’ve grouped into five stages.


While agencies often go through these stages when rolling out monitoring for the first time, they can also occur when a group or department starts to seriously implement an existing solution, or when new capabilities are added to a current monitoring suite.


Stage One: Monitor Everything


In this initial monitoring non-decision to “monitor everything,” it is assumed that all the information is good and can be “tuned up” later. Everyone is in denial that there’s about to be an alert storm.


Stage Two: The Prozac Moment


“All these things can’t possibly be going wrong!” This ignores the fact that a computer only defines “going wrong” as requested. So you ratchet things down, but “too much” is still showing red and the reaction remains the same.


Monitoring is catching all the stuff that’s been going up and down for weeks, months, or years, but that nobody noticed. It’s at this moment you might have to ask the system owner to calm down so they will chill out and realize that knowing about outages is the first step to avoiding them.


Stage Three: Painting the Roses Green


The next stage occurs when too many things are still showing as “down” and no amount of tweaking is making them show “up” because, ahem, they are down.


System owners may ask you to change alert thresholds to impossible levels or to disable alerts entirely. I can understand the pressure to adjust reporting to senior management, but let’s not defeat the purpose of monitoring, especially on critical systems.


What makes this stage even more embarrassing is that the work involved in adjusting alerts is often greater than the work required to actually fix the issues causing them.


Stage Four: An Inconvenient Truth


If issues are suppressed for weeks or months, they will reach a point when there’s a critical error that can’t be glossed over. At that point, everything is analyzed, checked, and restarted in real time. For a system owner who has been avoiding dealing with the real issues, there is nowhere left to run or hide.


Stage Five: Finding the Right Balance


Assuming the system owner has survived through stage four with their job intact, stage five involves trying to get it right. Agencies need to make the investment to get their alerting thresholds set correctly and vary them based on the criticality of the systems. There’s also a lot that smart tools can do to correlate alerts and reduce the number of alerts the IT team has to manage. You’ll just have to migrate some of your unreliable systems and fix the issues that are causing network or systems management problems as time and budget allow.


Find the full article on Federal Technology Insider.