Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 17

The Myth of the 5 9's

Jump to solution

I just posted this over on my personal blog (The Myth of Five-Nines | AdatoSystems Monitoring), but I'm sharing it here too. Hopefully you can use it to help explain to colleagues, customers or management why five-nines reliability may NOT be a goal to strive for nor a promise to make.


“Five-Nines” refers to something that is available 99.999% of the time. It’s become a catch phrase in various parts of the IT industry.

It’s also complete bullshit.

Sean Hull did a great job explaining why five-nine’s is over-rated in this post. But my point is NOT that this level of reliability is expensive. It’s that it’s nearly impossible to functionally achieve.

I’m also saying that the demands for (or the claims of) “Reliability in the five-nines” are highly over-blown.

Let’s do the math.

  • In a single minute, 5-9s means you would be unavailable just .0006 of a second.
  • In an hour, you could have .036 seconds of downtime
  • In a day, your system would get .86 seconds of breathing room
  • In a week, you could take a 6.04 second break before being less than 5-9′s
  • In an entire month, you’d only get 24.192 seconds of downtime
  • In any given fiscal quarter, you could expect just over a minute – 72.576 seconds, to be precise – of an outage
  • Half a year? You get over two minutes – 145.152 seconds – where that system was not available
  • And in a whole year, your 5-9′s system would experience just under 5 minutes (290.304 seconds) of outage. Total. Over the entire year.

You seriously expect any device, server or service to be available to all users for all but 5 minutes in an entire year?

This has implications for us as monitoring professionals. After all, we’re the ones tasked with watching the systems, raising the flag when things become unavailable. When someone is less than 99.999% available, we’re the ones the system owners come to, asking us to “paint the roses green”. We’re the ones that will have to re-check our numbers, re-calculate our thresholds, and re-explain for the thousandth time that “availability” always carries with it observational bias.

Yes, Mr. CEO, the server was up. It was up and running in a datacenter where a gardener with backhoe had severed the WAN circuit; it was up and running and everyone in the country could see it except for you, because wifi was turned off on your laptop; it was up and running but it showed “down” in monitoring  because someone changed the firewall rules so that my monitoring server could no longer reach it…

There’s an even more pressing fact-on-the-ground: Polling cycles are neither constant nor instant. Realistic polling intervals sit at around 1-2 minutes for “ping” type checks, and 5 minutes for data collection. If I’m only checking the status of a server every minute, and my monitoring server is dealing with more than that one machine, the reality is that I won’t be cutting a ticket for a “down” device for 3-5 minutes. That blows your 5-9′s out of the water right there.

But all of that is beside the point. First you need to let me know if your corporate server team is down with 5-9′s availability guarentees. Are they promising that the monthly patch and reboot process will take less than half a minute, end to end?

I’m thinking ‘no’.

Leon Adato | Head Geek
"Measure what is measurable,
and make measurable what is not so." - Gallileo

Labels (1)
Tags (2)
1 Solution

Instead of thinking about the availability of a switch, router, or web server instead think about the availability of the service.

if you're running multi-active data centers were user requests can be serviced out of any of them then you can do quite substantial work in a data center without affecting users, in that case you can quite easily get 5-9s operationally for a service, even if components are taken out of service. Where there is some difficulty is in the application design to ensure that it can operate in this mode, where the designer makes the assumption that infrastructure is built of unreliable equipment and human error.

(as an aside I've found that computer science graduates often start with "Let's assume we have infinite storage, bandwidth, and the hardware has 5-9s of reliability ", whereas engineering graduates start with "let's assume that the network is down, the power supply is uncertain, and the code was written by a computer scientist..." )

View solution in original post

17 Replies