I just posted this over on my personal blog (The Myth of Five-Nines | AdatoSystems Monitoring), but I'm sharing it here too. Hopefully you can use it to help explain to colleagues, customers or management why five-nines reliability may NOT be a goal to strive for nor a promise to make.
“Five-Nines” refers to something that is available 99.999% of the time. It’s become a catch phrase in various parts of the IT industry.
It’s also complete bullshit.
Sean Hull did a great job explaining why five-nine’s is over-rated in this post. But my point is NOT that this level of reliability is expensive. It’s that it’s nearly impossible to functionally achieve.
I’m also saying that the demands for (or the claims of) “Reliability in the five-nines” are highly over-blown.
Let’s do the math.
You seriously expect any device, server or service to be available to all users for all but 5 minutes in an entire year?
This has implications for us as monitoring professionals. After all, we’re the ones tasked with watching the systems, raising the flag when things become unavailable. When someone is less than 99.999% available, we’re the ones the system owners come to, asking us to “paint the roses green”. We’re the ones that will have to re-check our numbers, re-calculate our thresholds, and re-explain for the thousandth time that “availability” always carries with it observational bias.
Yes, Mr. CEO, the server was up. It was up and running in a datacenter where a gardener with backhoe had severed the WAN circuit; it was up and running and everyone in the country could see it except for you, because wifi was turned off on your laptop; it was up and running but it showed “down” in monitoring because someone changed the firewall rules so that my monitoring server could no longer reach it…
There’s an even more pressing fact-on-the-ground: Polling cycles are neither constant nor instant. Realistic polling intervals sit at around 1-2 minutes for “ping” type checks, and 5 minutes for data collection. If I’m only checking the status of a server every minute, and my monitoring server is dealing with more than that one machine, the reality is that I won’t be cutting a ticket for a “down” device for 3-5 minutes. That blows your 5-9′s out of the water right there.
But all of that is beside the point. First you need to let me know if your corporate server team is down with 5-9′s availability guarentees. Are they promising that the monthly patch and reboot process will take less than half a minute, end to end?
I’m thinking ‘no’.
Solved! Go to Solution.
Instead of thinking about the availability of a switch, router, or web server instead think about the availability of the service.
if you're running multi-active data centers were user requests can be serviced out of any of them then you can do quite substantial work in a data center without affecting users, in that case you can quite easily get 5-9s operationally for a service, even if components are taken out of service. Where there is some difficulty is in the application design to ensure that it can operate in this mode, where the designer makes the assumption that infrastructure is built of unreliable equipment and human error.
(as an aside I've found that computer science graduates often start with "Let's assume we have infinite storage, bandwidth, and the hardware has 5-9s of reliability ", whereas engineering graduates start with "let's assume that the network is down, the power supply is uncertain, and the code was written by a computer scientist..." )
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.