Building a monitoring and alerting system should always be driven by your business needs. This is always a debate between the IT organization which tends to focus on granular measures, whereas the business users would like to see more of an end to end picture of the organization. An example of this would be uptime--as a DBA, if my database is available and servicing requests, I feel as though I’ve met my uptime goals, whatever they may be. However, if a load balancer goes down taking away access to the application tier, the application is unavailable to users, and that is all that matters. Building a monitoring solution that looks at systems holistically is challenging, and sometimes requires working backwards from desired monitoring objectives (is the system up) to the choosing indicators (is the database service available and writeable), and then building a target.

 

Defining Service Level Objectives

 

You want to focus on what your users care about, and not necessarily what is easy to measure. There are two main areas you will want to use to define these objects--performance and uptime. One notion that comes from Google’s Site Reliability Engineering is the notion of an error budget--a rate at which these service level objects can be missed. Additionally, having an error budget can allow you to be more aggressive with upgrades and resolving technical debt. While evaluating projects and change control efforts you can know that if you are well ahead of your SLO budget you can be more aggressive with rollout. If you are behind the curve, you may curtail some migration efforts.

 

Target Values for SLOs

 

Target values will be a negotiation between IT and the business. From an IT perspective it is important to not overpromise--for example if you only have one physical server in your stack, you probably aren’t going to reach 99.99% uptime. This is important for a few reasons, but in my opinion the biggest is helping the business users understand the correlation between resource cost and availability. In the above one server example, if the business wants that application to deliver 99.99% uptime, it is going to have to invest in redundancy at several levels. There are a few other tenants to think about:

 

  • Past performance isn’t a predictor of future performance--While building a performance target off of your historic baseline is a good start, it does not address the problem of a system that performs well at its current level, but that will fall off a cliff without a major reengineering effort.
  • Don’t Overthink Your Targets--While it may be tempting to bring in someone from the data science team to create your new targets using a complex machine learning K-means clustering algorithm, you are better of creating simple targets like percentage uptime and throughput. If you can’t explain your target in a sentence it is likely too complex.
  • Absolutes are bad--The notion of a system that is always available and can scale infinitely is completely unrealistic. Even hyperscale cloud providers have difficulties delivering 99.999% uptime. It’s better to promise what you can deliver and make the business understand what the cost of delivering more is.

 

This process allows you to set clear expectations with your business and reduces some of the finger pointing during outages. It does require a strong relationship between IT management and senior leadership of your organization, but in the end delivers IT that can be kept up to date while meeting the business needs of the organization.