The Cost of Monitoring (but not Automating)

OR: "Don’t just sit there, DO something!"

If you have used a monitoring tool for any length of time, you are certainly comfortable setting up new devices like servers, routers, switches and the like. Adding sub-elementsdisks, interfaces, and the like are probably a snap for you. There’s a good chance you’ve set up your fair share of reports and data exports. Alerts? Pssshhhh! It’s a walk in the park, right?

But what do you DO with those alerts?

If you are like most IT Professionals who use a monitoring tool, you probably set up an email, or a text message to a cellphone, or if you are especially ambitiousan automated ticket in whatever incident system your company uses. And once that’s set up, you call it a day.

Your monitoring will detect an error, a notification will be sent out, a human will be launched into action of some kind, and the problem will (eventually) be resolved.

But why? Why disturb a living, breathing, working (or worse – sleeping) person if a computer could do something about the situation?

The fact is that many alerts have a simple response which can often be automated, and doing sotaking that automatic actioncan save hours of human time.

Here are some examples of direct action you can take:

A monitor triggers when

Have automation do this

A service is down

Attempt to restart the service

A disk is over xx% full

Clear the standard TEMP folders

An IP address conflict is detected

Shut down the port of the newer device

If the action is not successful, most monitoring systems will trigger a secondary action (that email, text message, or ticket I mentioned earlier) after a second wait time. (Pro Tip: If your monitoring solution doesn’t support this, it may be time to re-think your monitoring solution).

At worst, your alert will be delayed by a few minutes. BUT it will be delayed by having done (instantly) what the human technician was going to do once they logged in, so in a sense the situation is more than a few minutes ahead of where it would be if you had let the human process proceed as normal.

But that’s not all. Another action you can take is to gather information. Many monitoring tools will allow you to collect additional information at the time of the alert, and then “inject” it into the alert. For example:

A monitor triggers when

Have automation do this

CPU utilization is over xx%

Get the top 10 processes, sorted by CPU usage

RAM utilization is over xx%

Get the top 10 processes, sorted by RAM usage

A VM is using more than xx% of the host resources

Include the VM name in the message

Disk is over xx% full (after clearing temp folders)

Scan disk for top 10 files, sorted by size, that have been added or updated in the last 24 hours

Sounds lovely, but is this really going to impact the bottom line?

For a previous client, I implemented nothing more sophisticated than the disk actions (clearing the Temp drive, and alerting after another 15 minutes if the disks were still full) and adding the top 10 processes to the high CPU alert.

The results were anywhere from 30% to 70% fewer alerts compared to the same month in the previous year. In real numbers, this translated to anywhere from 43 to 175 fewer alerts per month. In addition, the support staff saw the results and responded FASTER to the remaining alerts because they knew the pre-actions had already been done.

The CPU alerts obviously didn’t reduce, but once again we saw the support staff response improve, since the ticket now included information about what specifically was going wrong. In one case, the client was able to go back to a vendor and request a patch because they were able to finally prove a long-standing issue with the software.

As virtualization and falling costs (coupled, thankfully, with expanding budgets) push the growth of IT environments, the need to leverage monitoring to ensure the stability of computing environments becomes ever more obvious. Less obvious, but just as critical (and valuable) is the need ensure that the human cost of that monitoring remains low by leveraging automation.

NOTE: This is a continuation of my Cost of Monitoring series. The first installment can be found here: The Cost of (not) Monitoring.

Part two is posted here: The Cost of Monitoring - with the wrong tool (part 1 of 2) and here: The Cost of Monitoring - with the wrong tool (part 2 of 2)

Thwack - Symbolize TM, R, and C