You Can Only Fix What You Measure (So Measure What You Want to Fix)

Recently, my colleagues Pete Di Stefano, Ashley Adams, and I hosted a webcast on the topic of capacity planning and optimization. 

As part of the discussion, we talked about the need to measure the right things to get the correct outcome. Keep in mind my oft-repeated mantra: monitoring is simply the collection of data. You need a robust, mature tool to add context, which transforms those metrics into information. Only when you have contextually accurate information can IT folks hope to act—fixing the right problem to achieve the best result.

But do you know the result you want? Because if you aren’t clear on this point, you’ll end up fixing the wrong thing. And often, this leads you to emphasize measuring the wrong thing. Let me start off with a true story to illustrate my point:

As I’ve mentioned in the past my dad was a musician, a percussionist (drummer), for the Cleveland Orchestra for almost 50 years. Because of the sheer variety of percussion instruments a piece might require—snare drum, kettle drum, bass drum, gong, xylophone, marimba, cymbals, and more—the folks in the section would give “ownership” of specific instruments to each team member. My dad always picked cymbals.

Here’s the punchline: I asked him why, one time, and he told me, “My pay per note is way higher than the guy playing snare drum.”

<rimshot>

The ridiculousness of Dad’s comment underscores something I see in IT (and especially in monitoring): Two valid metrics (pay, and number of notes you play) aren’t necessarily valid when you relate them. It’s yet another expression of the old XKCD comic on correlation versus causation.

Here’s a less fanciful but equally ridiculous example I saw at several past jobs: ticket queue management. It’s fair to say if you work on a helpdesk, closing tickets is important. Ostensibly, it’s an indication of completed work. But anyone who has actually worked on a helpdesk knows this isn’t completely true. I can close a ticket without completing the work. Heck, some ticket systems let me mass-close a whole bunch of tickets at the same time, never having done any of the tasks requested in them.

Nevertheless, it’s common to see a young, hungry, eager-to-prove-themselves helpdesk manager implement a policy emphasizing ticket closure rates. These policies use a wide range of carrots (or sticks), but all pointing in the same direction: “close ALL your tickets, or else!” The inevitable result is support folks dutifully closing every one of their tickets, whether they’re completed or not, sometimes before the customer has hung up the phone.

The problem stems from using ticket closure as a key metric. Tickets aren’t, in and of themselves, a work product. They’re merely indicators of work. It’s possible to close 100 tickets a day and not help a single person or accomplish a single task. Closed tickets don’t speak to quality, satisfaction, or expertise. They also don’t speak to the durability of the solution. “Have you tried turning it off and on again?” will probably make the problem go away (and let you close the ticket), but it’s highly likely the problem will come back, since nothing was actually fixed, only deferred.

We who make our career (or at least spend part of our day) with monitoring solutions are also familiar with how this plays out day to day. Measure high CPU (or RAM, or disk usage) with no other context or input, where a spike over 80% triggers an alert, and you’ll quickly end up with technicians who over-provision the system. Measure uptime on individual systems without the context of load-balancing, and you’ll quickly end up with managers who demand five-nines even though the customer experience was never impacted. And so on.

The point—both of this blog and of my original discussion with Ashley and Pete—is to understand how optimization cannot happen without first gathering metrics; and gathering raw metrics without also including context leads to all manner of tomfoolery.

THWACK - Symbolize TM, R, and C