The Cost of (not) Monitoring
What does a wireless thermometer have in common with ping? Both can keep a business from losing cash.
One of the ways businesses stay in business is by keeping a tight rein on costs. So, it should come as no surprise that convincing executives to allocate budget money toward IT monitoring software can be a challenge. To the average executive—and let’s be honest, the mid-level manager listening to the technical ramblings of an excited but fiscally vague IT Pro—monitoring seems like a pure sunk cost with no possibility of return.
However, IT pros know this couldn’t be further from the truth. All that’s required to help others understand why, is to answer a single question: how much will not monitoring cost?
Case in point: recently, a 300-bed hospital considered implementing a $5,000 automated temperature monitoring system for the freezers where the hospitals supply of food was stored. The system would have saved staff time by measuring the current temperature in each of the coolers and freezers, and sending notifications if the temperature was out of acceptable range.
Hospital administration declined, deeming the solution too expensive just to know that a freezer was five degrees too cold. Needless to say, one of the staff members eventually left the door to the main cooler open, which caused the compressor to run all evening until it failed completely. The next morning, staff arrived only to find all the food in that cooler had spoiled. Recovering from this failure required emergency food orders, extra staff, repair services and a lot of overtime.
The total cost of the outage came to a cool $1 million—200 times more than the cost of the monitoring system deemed to be “too expensive.” This kind of scenario, where a small upfront investment could have prevented costly problems down the road, should sound hauntingly familiar to IT pros.
With this example in mind, it behooves us as IT professionals to be able to explain—in clear terms that non-technical staff can understand—what is intuitively obvious to those of us in the trenches: the cost of not monitoring is often far greater than the tools that could help us avoid failures in the first place.
Convincing non-IT staff of the need for monitoring tools after a critical system failure is probably a little easier, as outages tend to remain fresh in people’s minds for a long time. But just how can IT pros make the case for monitoring without first experiencing an actual IT resource failure? Or, if an organization has experienced a failure with a particular system, how can IT pros make the case for purchasing monitoring tools to protect other mission critical systems?
It really comes down to identifying the potential costs of a failure. Every management team feels differently; what leadership at one organization feels is catastrophic, others might simply consider the cost of doing business. Therefore, IT pros need to highlight costs that are eminently avoidable. Some things to consider are:
- The ultimate end result of a problem if it goes undetected
- The amount of time a particular failure could go unreported
- The amount of time it would take to fix the system from as a result of a failure
- Regular hourly staff cost for the system in question
- Emergency and overtime staff cost for the system in question
- Planned vendor maintenance costs versus emergency vendor repair costs
- Lost sales or other income per hour if the system in question is unavailable
To understand how all this fits together, consider the simple example of a hard drive failure on a primary email server.
To begin with, no self-respecting IT pro would be caught dead without some form of fault-tolerance for a critical system such as email. So, in this example, let’s say a mirrored drive was in place, but it failed a couple days prior to the second drive’s failure. Since there was no monitoring solution in place, nobody noticed, effectively making it a single drive system.
The end result is that the system would crash. You would think an email system crash would be immediately noticeable, but email clients like Outlook do a great job of offline caching, so it can actually take a while before anyone notices. In this example, let’s say it takes 30 minutes.
Recovering from a hard drive failure takes time unless there are spare parts immediately on hand and some kind of instant recovery option. Let’s estimate that replacing the drive itself takes about an hour, and restoring from backup takes another hour. However, this is a vendor repair. That’s either a four hour lead time or one hour for emergency service.
Now let’s look at the costs. Let’s say regular staff time is $53 per hour while overtime is $75 per hour. Standard vendor repair is free, but remember that four hour lead time. Emergency vendor repair is $150 per hour with a two hour minimum.
This means email will be offline for between three and a half to six and half hours, with a cost of between $106 and $450. This may not seem like a big deal. However, that is the cost of just one drive failure. Consider a company that experiences 350 drive failures a year (something I have personally witnessed). Now we’re talking about between $37,000 and $157,000 per year—not counting company revenue lost while email is down and productivity plummets as a result.
Now, of course, drives fail whether they are monitored or not. However, in the above example, catching the first drive failure, replacing it at a convenient time and avoiding both the outage and the time spent performing data recovery could save between $18,500 and almost $140,000 over the course of a year.
It’s important to go through a similar exercise for all mission critical systems in the IT environment—including email, CRM and Web services—combined with different types of outages, such as disk failure, application crashes and network failure.
To avoid becoming overwhelmed, prioritize. Take a hard look at the IT environment and honestly assess what systems are rock-solid and which are a bit shakier. Also, leverage other team members where necessary by asking them how long it takes to identify when their systems are offline, and how long it takes to bring them back up.
This process may seem tedious, but all too often it’s what it takes to help non-IT executives and other decision makers understand that proper monitoring is crucial, and that the cost of not monitoring can far exceed that of doing so. Simply put: speak their language, which is the language of money.
BONUS: To help you get started, I've uploaded a spreadsheet that collects this information and does some simple calculations: Monitoring_Cost_Estimator.xlsx
Note: This article originally appeared on InfoTech Spotlight. Click here to read that version.