The Cost of Monitoring - with the wrong tool (part 1 of 2)
or “It’s not just the cost of buying the puppy, it’s the cost of feeding the puppy that adds up”
This is a story I've told out loud many times, on conference calls, and around water coolers. But I've never written it down fully until now. This is the story of how using the wrong tool for the job can cost a company so much money it boggles the mind. It's a story I've witnessed more than once in my career, and heard anecdotally from colleagues over a dozen times.
Before I go into the details, I want to offer my thoughts on how companies get into this situation.
The discipline of monitoring has existed since that first server came online and someone wanted to know if it was "still up." And sophisticated tools to perform monitoring have been around for over two decades, often being implemented in a "for the first time" manner at most companies. Some of it has to do with inexperience. For example, either the monitoring team is young/new and hasn't experienced monitoring at other companies, or the company itself is new and has just grown to the point where it needs it. Or there's been sufficient turn-over, such that the people on the job now are so removed from those that implemented the previous system, that for all intents and purposes the solution or situation at hand is effectively "new."
In those cases, organizations end up buying the wrong tool because they simply don't have the experience to know what the right one is. Or more to the point…the right ONES. Because monitoring in all but the smallest organizations is a heterogeneous affair. There is no one-stop shop, no one-size-fits-all solution.
But that's only part of it. In many cases, the cost of monitoring has bloated beyond all reason due to the effect known as "a dollar auction". Simply put, the barrier to using better tools is the unwillingness to walk away from all the money sunk into purchasing, deploying, developing, and maintaining the first.
And that leads me back to my story. A company hired me to improve their monitoring. Five years earlier, they had invested in a monitoring solution from one of the "big three" solution providers. Implementing that solution took 18 months and 5 contractors (at a cost of $1 million in contractor costs, plus $1.5million for the actual software and hardware). After that, a team of 9 employees supported the solution—setting up new monitors and alerts, installing patches, and just keeping the toolset up and running. Aside from the staff cost, the company paid about $800,000 a year in maintenance.
With this solution they were able to monitor most of the 6,000 servers in the environment—a blend of windows, Unix, Linux, and AS400 systems; and they could perform up/down (ping) monitoring for the 4,000 network devices. But they encountered serious limitations monitoring network hardware, non-routable interfaces, and other elements.
Meanwhile, the server and application monitoring inventory—the actual monitors, reports, triggers, and scripts—showed signs of extreme "bloat." They had over 7,000 individual monitoring situations, and around 3,000 alert triggers.
This was the first company where the monitoring and network teams weren't practically best friends and even the server monitoring was showing signs of strain. Some applications weren't well-monitored either because the team was unfamiliar with it, or because the tool couldn't get the data needed.
Part of the problem, as I mentioned earlier, was that the company had invested a lot in the tool, and wanted to "get their money's worth." So they attempted to implement it everywhere, even in situations where it was less than optimal. Because it was shoehorned into awkward situations, the monitoring team spent inordinate amounts of time not only making it fit, but keeping it from breaking.
KEY IDEA: You don't get your money's worth out of an expensive tool by putting it into as many places as you can, thereby making it more expensive. You get your money's worth by using each tool in a way that maximizes the things it does well and avoids the things it does not do well.
NOTE: This is a continuation of my Cost of Monitoring series. The first installment can be found here: The Cost of (not) Monitoring
Stay tuned for part 2, which I will post on January 20th to see how we resolved this situation.
edit LJA20150116: forgot to include the link to explain a dollar auction.