The Cost of Monitoring (but not Automating)

OR: "Don’t just sit there, DO something!"

If you have used a monitoring tool for any length of time, you are certainly comfortable setting up new devices like servers, routers, switches and the like. Adding sub-elementsdisks, interfaces, and the like are probably a snap for you. There’s a good chance you’ve set up your fair share of reports and data exports. Alerts? Pssshhhh! It’s a walk in the park, right?

But what do you DO with those alerts?

If you are like most IT Professionals who use a monitoring tool, you probably set up an email, or a text message to a cellphone, or if you are especially ambitiousan automated ticket in whatever incident system your company uses. And once that’s set up, you call it a day.

Your monitoring will detect an error, a notification will be sent out, a human will be launched into action of some kind, and the problem will (eventually) be resolved.

But why? Why disturb a living, breathing, working (or worse – sleeping) person if a computer could do something about the situation?

The fact is that many alerts have a simple response which can often be automated, and doing sotaking that automatic actioncan save hours of human time.

Here are some examples of direct action you can take:

A monitor triggers when

Have automation do this

A service is down

Attempt to restart the service

A disk is over xx% full

Clear the standard TEMP folders

An IP address conflict is detected

Shut down the port of the newer device

If the action is not successful, most monitoring systems will trigger a secondary action (that email, text message, or ticket I mentioned earlier) after a second wait time. (Pro Tip: If your monitoring solution doesn’t support this, it may be time to re-think your monitoring solution).

At worst, your alert will be delayed by a few minutes. BUT it will be delayed by having done (instantly) what the human technician was going to do once they logged in, so in a sense the situation is more than a few minutes ahead of where it would be if you had let the human process proceed as normal.

But that’s not all. Another action you can take is to gather information. Many monitoring tools will allow you to collect additional information at the time of the alert, and then “inject” it into the alert. For example:

A monitor triggers when

Have automation do this

CPU utilization is over xx%

Get the top 10 processes, sorted by CPU usage

RAM utilization is over xx%

Get the top 10 processes, sorted by RAM usage

A VM is using more than xx% of the host resources

Include the VM name in the message

Disk is over xx% full (after clearing temp folders)

Scan disk for top 10 files, sorted by size, that have been added or updated in the last 24 hours

Sounds lovely, but is this really going to impact the bottom line?

For a previous client, I implemented nothing more sophisticated than the disk actions (clearing the Temp drive, and alerting after another 15 minutes if the disks were still full) and adding the top 10 processes to the high CPU alert.

The results were anywhere from 30% to 70% fewer alerts compared to the same month in the previous year. In real numbers, this translated to anywhere from 43 to 175 fewer alerts per month. In addition, the support staff saw the results and responded FASTER to the remaining alerts because they knew the pre-actions had already been done.

The CPU alerts obviously didn’t reduce, but once again we saw the support staff response improve, since the ticket now included information about what specifically was going wrong. In one case, the client was able to go back to a vendor and request a patch because they were able to finally prove a long-standing issue with the software.

As virtualization and falling costs (coupled, thankfully, with expanding budgets) push the growth of IT environments, the need to leverage monitoring to ensure the stability of computing environments becomes ever more obvious. Less obvious, but just as critical (and valuable) is the need ensure that the human cost of that monitoring remains low by leveraging automation.

NOTE: This is a continuation of my Cost of Monitoring series. The first installment can be found here: The Cost of (not) Monitoring.

Part two is posted here: The Cost of Monitoring - with the wrong tool (part 1 of 2) and here: The Cost of Monitoring - with the wrong tool (part 2 of 2)

  • What is funny about this is that "Automated Operations" was the new cool buzzword back in 1990. 

    Lights out data centers were the holy grail of the time. 

    AO was embraced by the mainframe community and somewhat more slowly of time in the unix and windows worlds.  Some of the windows and unix sysadmins embrace it today and some don't.

  • Great article Leon!  Automation is the way to go.  Thanks again.

  • Well said.  Automation also removes the human mistake factors. 

  • Thanks Leon for some good starting points as usual - noted for when we get back to incident automation.

    From working as part of an OSS Team that were off to the side of Ops and thought about as only as the guys whom fix the alarm browser and ensure the network reports were available has turned into being the 'It' team that is driving automation into other areas of the business.

    We've always been running scripts in the background for ourselves to help get the job done, but yes the time has come to propel our work to assist our colleagues - especially when there is the whole upper management ethos of keeping headcount lean.

    Currently we're working on the inbound of devices into the said monitoring and CMDb systems, where human error copying information multiple times cascades immensely once it comes time to actually investigate an incident. 

THWACK - Symbolize TM, R, and C