Defining an SLO-Understanding Business Needs

jdanton over 5 years ago 2 minute read time

Building a monitoring and alerting system should always be driven by your business needs. This is always a debate between the IT organization which tends to focus on granular measures, whereas the business users would like to see more of an end to end picture of the organization. An example of this would be uptime--as a DBA, if my database is available and servicing requests, I feel as though I’ve met my uptime goals, whatever they may be. However, if a load balancer goes down taking away access to the application tier, the application is unavailable to users, and that is all that matters. Building a monitoring solution that looks at systems holistically is challenging, and sometimes requires working backwards from desired monitoring objectives (is the system up) to the choosing indicators (is the database service available and writeable), and then building a target.

Defining Service Level Objectives

You want to focus on what your users care about, and not necessarily what is easy to measure. There are two main areas you will want to use to define these objects--performance and uptime. One notion that comes from Google’s Site Reliability Engineering is the notion of an error budget--a rate at which these service level objects can be missed. Additionally, having an error budget can allow you to be more aggressive with upgrades and resolving technical debt. While evaluating projects and change control efforts you can know that if you are well ahead of your SLO budget you can be more aggressive with rollout. If you are behind the curve, you may curtail some migration efforts.

Target Values for SLOs

Target values will be a negotiation between IT and the business. From an IT perspective it is important to not overpromise--for example if you only have one physical server in your stack, you probably aren’t going to reach 99.99% uptime. This is important for a few reasons, but in my opinion the biggest is helping the business users understand the correlation between resource cost and availability. In the above one server example, if the business wants that application to deliver 99.99% uptime, it is going to have to invest in redundancy at several levels. There are a few other tenants to think about:

Past performance isn’t a predictor of future performance--While building a performance target off of your historic baseline is a good start, it does not address the problem of a system that performs well at its current level, but that will fall off a cliff without a major reengineering effort.
Don’t Overthink Your Targets--While it may be tempting to bring in someone from the data science team to create your new targets using a complex machine learning K-means clustering algorithm, you are better of creating simple targets like percentage uptime and throughput. If you can’t explain your target in a sentence it is likely too complex.
Absolutes are bad--The notion of a system that is always available and can scale infinitely is completely unrealistic. Even hyperscale cloud providers have difficulties delivering 99.999% uptime. It’s better to promise what you can deliver and make the business understand what the cost of delivering more is.

This process allows you to set clear expectations with your business and reduces some of the finger pointing during outages. It does require a strong relationship between IT management and senior leadership of your organization, but in the end delivers IT that can be kept up to date while meeting the business needs of the organization.

Top Comments

petergwilson over 5 years ago +1

Just had an instance this morning where one of the HR muppets logged a ticket saying the HR database server was down. Using a combination of monitoring and a very small amount of brain power I was able…

tinmann0715 over 5 years ago

When I took over monitoring I was in the unique position of defining our SLO's. Prior to me it did not exist. The user community defined our current levels, application availability, and overall UX. I had a hard time selling my proposed SLO's because my execs didn't want to commit without fully understanding the level of commitment... $$$, resources, etc.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
petergwilson over 5 years ago

Just had an instance this morning where one of the HR muppets logged a ticket saying the HR database server was down. Using a combination of monitoring and a very small amount of brain power I was able to prove that it wasn't. SAM monitoring showed that it was up as was the database and Appinsight for SQL showed good stats. Trial of LEM showed last reboot a few days ago. Brainpower allowed me to log into the system and retrieve data. Also, no one else had a problem. He couldn't really argue with nice graphs and screenshots.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
bobmarley over 5 years ago

Ha! So true!
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
mtgilmore1 over 5 years ago

Nice article.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
david.botfield over 5 years ago

Good Article
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel