Jogging is my exercise. I use it to tune out noise, focus on a problem at hand, avoid interruptions, and stay healthy. Recently, I was cruising at a comfortable nine-minute pace when four elite runners passed me, and it felt like I was standing still. It got me thinking about the relationship between health and performance. I came to the conclusion that they are related, but more like distant cousins than siblings.
I can provide you data that indicates health status: blood pressure, resting heart rate, BMI, body fat percentage, current illnesses, etc. Given all that, tell me: can I run a four-minute mile? That question can’t be answered solely with the data I provided. That’s because I’m now talking about performance versus health.
We can also look at health metrics with databases: CPU utilization, I/O stats, memory pressure, etc. However, those also can’t answer the question of how your databases and queries are performing. I’d argue that both health AND performance monitoring and analysis are important. They can impact each other but answer different questions.
“What gets measured gets done.” I love this saying and believe that to be true. The tricky part is making sure we’re measuring the right thing to ensure we’re driving the behavior we want.
Health is a very mature topic and pretty much all database monitoring solutions offer visibility into it. Performance is another story. I love this definition of performance from Craig Mullins as it relates to databases: “the optimization of resource use to increase throughput and minimize contention, enabling the largest possible workload to be processed.”
Interestingly, I believe this definition would be widely accepted, yet approaches to achieving this with monitoring tools varies widely. While I agree with this definition, I’d add “in the shortest possible time” to the end of it. If you agree that you need to consider a time component in regards to database performance, now we’re talking about wait-time analysis. Here’s a white paper that goes into much more detail on this approach and why it is the correct way to think about database performance.
We can only get to the right answer regarding root cause if we’re collecting (measuring) the right data in the first place. Below is a chart with some thoughts on data collection requirements. Adapt as needed, but I hope it provides a workable framework.
Remember: don’t stop with asking “What can we do?” Take it to the next level and instead ask, “What should we do?”