I just posted this over on my personal blog (The Myth of Five-Nines | AdatoSystems Monitoring), but I'm sharing it here too. Hopefully you can use it to help explain to colleagues, customers or management why five-nines reliability may NOT be a goal to strive for nor a promise to make.
“Five-Nines” refers to something that is available 99.999% of the time. It’s become a catch phrase in various parts of the IT industry.
It’s also complete bullshit.
Sean Hull did a great job explaining why five-nine’s is over-rated in this post. But my point is NOT that this level of reliability is expensive. It’s that it’s nearly impossible to functionally achieve.
I’m also saying that the demands for (or the claims of) “Reliability in the five-nines” are highly over-blown.
Let’s do the math.
- In a single minute, 5-9s means you would be unavailable just .0006 of a second.
- In an hour, you could have .036 seconds of downtime
- In a day, your system would get .86 seconds of breathing room
- In a week, you could take a 6.04 second break before being less than 5-9′s
- In an entire month, you’d only get 24.192 seconds of downtime
- In any given fiscal quarter, you could expect just over a minute – 72.576 seconds, to be precise – of an outage
- Half a year? You get over two minutes – 145.152 seconds – where that system was not available
- And in a whole year, your 5-9′s system would experience just under 5 minutes (290.304 seconds) of outage. Total. Over the entire year.
You seriously expect any device, server or service to be available to all users for all but 5 minutes in an entire year?
This has implications for us as monitoring professionals. After all, we’re the ones tasked with watching the systems, raising the flag when things become unavailable. When someone is less than 99.999% available, we’re the ones the system owners come to, asking us to “paint the roses green”. We’re the ones that will have to re-check our numbers, re-calculate our thresholds, and re-explain for the thousandth time that “availability” always carries with it observational bias.
Yes, Mr. CEO, the server was up. It was up and running in a datacenter where a gardener with backhoe had severed the WAN circuit; it was up and running and everyone in the country could see it except for you, because wifi was turned off on your laptop; it was up and running but it showed “down” in monitoring because someone changed the firewall rules so that my monitoring server could no longer reach it…
There’s an even more pressing fact-on-the-ground: Polling cycles are neither constant nor instant. Realistic polling intervals sit at around 1-2 minutes for “ping” type checks, and 5 minutes for data collection. If I’m only checking the status of a server every minute, and my monitoring server is dealing with more than that one machine, the reality is that I won’t be cutting a ticket for a “down” device for 3-5 minutes. That blows your 5-9′s out of the water right there.
But all of that is beside the point. First you need to let me know if your corporate server team is down with 5-9′s availability guarentees. Are they promising that the monthly patch and reboot process will take less than half a minute, end to end?
I’m thinking ‘no’.