Choosing What to Monitor-Understanding Key Metrics

Monitoring has always been a loosely defined and somewhat controversial term in IT organizations. IT professionals have very strong opinions about the tools they use, because monitoring and alerting is one of the key components of keeping systems online, performing, and delivering IT’s value proposition to the business. However, as the world has shifted more toward a DevOps mindset, especially in larger organizations, the mindset around monitoring systems has also shifted. You are not going to stop monitoring systems, but you might want you want to rationalize what metrics you are tracking in order to give better focus.

What To Monitor?

While operating systems, hardware, and databases all expose a litany of metrics that can be tracked, choosing that many performance metrics can make it challenging to pay attention to critical things that may be happening in your systems. Such deep dive analysis is best reserved for deep troubleshooting of one-off problems, not day-to-day monitoring. One approach to consider is classifying systems into broad categories and applying specific indicators that allow you to evaluate the health of a system.

User Facing Systems like websites, e-commerce systems, and of course everyone’s most important system, email, have availability as the most important metric. Latency and throughput are secondary metrics, but especially for customer facing systems are equally as important.

Storage and Network Infrastructure should emphasize latency, availability, and durability. How long does a read or write take to complete, or how much throughput is a given network connection seeing?

Database systems, much like front-end systems, should be focused on end-to-end latency, but also throughput--how much data is being processed, how many transactions are happening per time period.

It is also important to think about which aspect of each metric you want to alert (page an on-call operator). I like to approach this with two key rules: any page should be on something that can be actionable (service down, hardware failures), and always remember there is a human cost to paging, so if you can have automation respond to page and fix something with a shell script, that’s all the better.

It is important to think about the granularity of your monitoring. For example, the availability of a database system might only need to be checked every 15 seconds or so, but the latency of the same system should be checked every second or more to capture all query activity. At each layer of your monitoring, you will want to think about this. This a classic tradeoff of volume of data collection in exchange for more detailed information.

Aggregation

In addition to doing real-time monitoring, it is important to understand how your metrics look over time. This can give you key insights into things like SAN capacity and the health of your systems over time. Also, it leads you to be able to identify anomalies and hot spots (i.e. end of month processing), but also plan for peak loads. This leads to another point - you should consider collection of metrics as distributions of data rather than averages. For example, if most of your storage requests are answered in less than 2 milliseconds, but you have several that take over 30 seconds, those anomalies will be masked in an average. By using histograms and percentiles in your aggregation, you can quickly identify when you have out of bounds values in an otherwise well-performing system.

Standardize

Defining a few categories of systems, and standardizing your data collection allows for common definitions, and can drive towards having common service level indicators. This allows you to build a common template for each indicator and common goal towards higher levels of service.

Anonymous

Top Comments

  • I think you are completely right byrona​!  Our audit are taking longer and longer for the auditors to do every week.  With the size growing everything takes longer to process.

  • ecklerwr1​ this is definitely a problem and one we are dealing with more and more.  Not only is the data getting more difficult to sift though but it's getting more and more difficult for people to maintain the alerts, dashboard, etc. definitions as what you need to be looking for is ever changing.

    I think this is a place where technology is going to have to evolve to solve the problem.  AI parsing the data real-time in memory looking for anomalies and storing the anomaly data in one data-store for quick access by analysts and then talking all of the other data and archiving it so that it's available but not necessarily as quickly accessible.  Something like that will be necessary.

  • lolol I read that too... I'm surprised the filter didn't catch it!

  • Thanks for all of the comments! This is part of a series where we will talk about some more modern approaches of monitoring. What I was trying to illustrate in the post was the ideal way to monitor is through key business processes. This is a big shift from monitoring from a purely system metrics perspective--however, you still need to collect a reasonable amount of system data in order to to support the underlying monitoring you are doing, and to provide troubleshooting. However, like I mentioned, its also important to think about those metrics and not collect too much superfluous data. My background is as a DBA--I love to have all the data about a problem--however, 90% of the troubleshooting I've ever done can be managed with a log file, IO and CPU usage stats.

    The alerting component is very much specific to your RTO/RPO needs--and your level of automation. In a future post, we'll talk about how you can start to automate out some of the paging events that happen in your environment. And one last message--don't underestimate the human element of paging, especially when you have full time stage who are getting paged out of bed and aren't getting bonus pay for solving problems.