The Cost of Monitoring (but not Automating)

adatole over 9 years ago 4 minute read time

OR: "Don’t just sit there, DO something!"

If you have used a monitoring tool for any length of time, you are certainly comfortable setting up new devices like servers, routers, switches and the like. Adding sub-elements—disks, interfaces, and the like are probably a snap for you. There’s a good chance you’ve set up your fair share of reports and data exports. Alerts? Pssshhhh! It’s a walk in the park, right?

But what do you DO with those alerts?

If you are like most IT Professionals who use a monitoring tool, you probably set up an email, or a text message to a cellphone, or if you are especially ambitious—an automated ticket in whatever incident system your company uses. And once that’s set up, you call it a day.

Your monitoring will detect an error, a notification will be sent out, a human will be launched into action of some kind, and the problem will (eventually) be resolved.

But why? Why disturb a living, breathing, working (or worse – sleeping) person if a computer could do something about the situation?

The fact is that many alerts have a simple response which can often be automated, and doing so—taking that automatic action—can save hours of human time.

Here are some examples of direct action you can take:

A monitor triggers when	Have automation do this
A service is down	Attempt to restart the service
A disk is over xx% full	Clear the standard TEMP folders
An IP address conflict is detected	Shut down the port of the newer device

If the action is not successful, most monitoring systems will trigger a secondary action (that email, text message, or ticket I mentioned earlier) after a second wait time. (Pro Tip: If your monitoring solution doesn’t support this, it may be time to re-think your monitoring solution).

At worst, your alert will be delayed by a few minutes. BUT it will be delayed by having done (instantly) what the human technician was going to do once they logged in, so in a sense the situation is more than a few minutes ahead of where it would be if you had let the human process proceed as normal.

But that’s not all. Another action you can take is to gather information. Many monitoring tools will allow you to collect additional information at the time of the alert, and then “inject” it into the alert. For example:

A monitor triggers when	Have automation do this
CPU utilization is over xx%	Get the top 10 processes, sorted by CPU usage
RAM utilization is over xx%	Get the top 10 processes, sorted by RAM usage
A VM is using more than xx% of the host resources	Include the VM name in the message
Disk is over xx% full (after clearing temp folders)	Scan disk for top 10 files, sorted by size, that have been added or updated in the last 24 hours

Sounds lovely, but is this really going to impact the bottom line?

For a previous client, I implemented nothing more sophisticated than the disk actions (clearing the Temp drive, and alerting after another 15 minutes if the disks were still full) and adding the top 10 processes to the high CPU alert.

The results were anywhere from 30% to 70% fewer alerts compared to the same month in the previous year. In real numbers, this translated to anywhere from 43 to 175 fewer alerts per month. In addition, the support staff saw the results and responded FASTER to the remaining alerts because they knew the pre-actions had already been done.

The CPU alerts obviously didn’t reduce, but once again we saw the support staff response improve, since the ticket now included information about what specifically was going wrong. In one case, the client was able to go back to a vendor and request a patch because they were able to finally prove a long-standing issue with the software.

As virtualization and falling costs (coupled, thankfully, with expanding budgets) push the growth of IT environments, the need to leverage monitoring to ensure the stability of computing environments becomes ever more obvious. Less obvious, but just as critical (and valuable) is the need ensure that the human cost of that monitoring remains low by leveraging automation.

NOTE: This is a continuation of my Cost of Monitoring series. The first installment can be found here: The Cost of (not) Monitoring.

Part two is posted here: The Cost of Monitoring - with the wrong tool (part 1 of 2) and here: The Cost of Monitoring - with the wrong tool (part 2 of 2)

Top Comments

jbiggley over 9 years ago +5

Automation is going to extract monitoring teams from the deep, dark recesses of the IT department and shove us, whether we like it or not, into the glaring light of day-to-day business operations. In…
emilie over 9 years ago +3

You said the secret word!
squinsey over 9 years ago +1

Thanks Leon for some good starting points as usual - noted for when we get back to incident automation. From working as part of an OSS Team that were off to the side of Ops and thought about as only as…

Jfrazier over 8 years ago

What is funny about this is that "Automated Operations" was the new cool buzzword back in 1990.
Lights out data centers were the holy grail of the time.
AO was embraced by the mainframe community and somewhat more slowly of time in the unix and windows worlds. Some of the windows and unix sysadmins embrace it today and some don't.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
wiseone over 8 years ago

Great article Leon! Automation is the way to go. Thanks again.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
deniseglucas over 8 years ago

Well said. Automation also removes the human mistake factors.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
squinsey over 9 years ago

Thanks Leon for some good starting points as usual - noted for when we get back to incident automation.
From working as part of an OSS Team that were off to the side of Ops and thought about as only as the guys whom fix the alarm browser and ensure the network reports were available has turned into being the 'It' team that is driving automation into other areas of the business.
We've always been running scripts in the background for ourselves to help get the job done, but yes the time has come to propel our work to assist our colleagues - especially when there is the whole upper management ethos of keeping headcount lean.
Currently we're working on the inbound of devices into the said monitoring and CMDb systems, where human error copying information multiple times cascades immensely once it comes time to actually investigate an incident.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
jwilson2013 over 9 years ago

Hehe, a 2 thumbs up button.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel