I just posted this over on my personal blog (The Myth of Five-Nines | AdatoSystems Monitoring), but I'm sharing it here too. Hopefully you can use it to help explain to colleagues, customers or management why five-nines reliability may NOT be a goal to strive for nor a promise to make.
“Five-Nines” refers to something that is available 99.999% of the time. It’s become a catch phrase in various parts of the IT industry.
It’s also complete bullshit.
Sean Hull did a great job explaining why five-nine’s is over-rated in this post. But my point is NOT that this level of reliability is expensive. It’s that it’s nearly impossible to functionally achieve.
I’m also saying that the demands for (or the claims of) “Reliability in the five-nines” are highly over-blown.
Let’s do the math.
You seriously expect any device, server or service to be available to all users for all but 5 minutes in an entire year?
This has implications for us as monitoring professionals. After all, we’re the ones tasked with watching the systems, raising the flag when things become unavailable. When someone is less than 99.999% available, we’re the ones the system owners come to, asking us to “paint the roses green”. We’re the ones that will have to re-check our numbers, re-calculate our thresholds, and re-explain for the thousandth time that “availability” always carries with it observational bias.
Yes, Mr. CEO, the server was up. It was up and running in a datacenter where a gardener with backhoe had severed the WAN circuit; it was up and running and everyone in the country could see it except for you, because wifi was turned off on your laptop; it was up and running but it showed “down” in monitoring because someone changed the firewall rules so that my monitoring server could no longer reach it…
There’s an even more pressing fact-on-the-ground: Polling cycles are neither constant nor instant. Realistic polling intervals sit at around 1-2 minutes for “ping” type checks, and 5 minutes for data collection. If I’m only checking the status of a server every minute, and my monitoring server is dealing with more than that one machine, the reality is that I won’t be cutting a ticket for a “down” device for 3-5 minutes. That blows your 5-9′s out of the water right there.
But all of that is beside the point. First you need to let me know if your corporate server team is down with 5-9′s availability guarentees. Are they promising that the monthly patch and reboot process will take less than half a minute, end to end?
I’m thinking ‘no’.
Solved! Go to Solution.
Instead of thinking about the availability of a switch, router, or web server instead think about the availability of the service.
if you're running multi-active data centers were user requests can be serviced out of any of them then you can do quite substantial work in a data center without affecting users, in that case you can quite easily get 5-9s operationally for a service, even if components are taken out of service. Where there is some difficulty is in the application design to ensure that it can operate in this mode, where the designer makes the assumption that infrastructure is built of unreliable equipment and human error.
(as an aside I've found that computer science graduates often start with "Let's assume we have infinite storage, bandwidth, and the hardware has 5-9s of reliability ", whereas engineering graduates start with "let's assume that the network is down, the power supply is uncertain, and the code was written by a computer scientist..." )
Part of a server team here, and we make no such guarantees. We do provide very good uptime, but not 99.999.
I seem to recall that the origin of the '5 9s' - or at least the first time I heard it - was around carriers and circuit delivery. The tier-1 carriers, with robust infrastructure and redundancy of their own, can definitely get closer to this than an application/service team could. The carriers are being a little disingenuous when they talk about uptime, too. We're moving away from these problems, but anyone remember the days of over-committed backhauls? I worked in some school districts in a very rural area for some time, and our carrier was a regional CLEC. What I found out after a few months was that they were oversubscribing bonded-T customers like it was going out of style, but only had one pipe for backhaul - a DS3. After dealing with a couple of puzzling latency issues, it comes out that there are 40-something Ts on that DS-3. Too many sites try to grab too much data, and there you go - chunk-style. That chunk-style uptime, however, was still uptime and contributed toward a positive for the carrier's SLA. Plus, as long as no one voted with their feet (and who could - it was the middle of nowhere with a dearth of broadband choices), the carrier wouldn't have to invest in beefing up the backhaul.
This was a good point that I *didn't* get into here (but plan to in another post). Uptime of *what*, exactly?
Yes, the server went down for 3 hours. It was one in a cluster of four. Shut up and let me get back to work.
Our focus at our FI is more Reliability than Availability.
Picture if we were building vases. It does not matter if the machine runs non stop if it is turning out every 5th vase broken.
While we do measure availability, the LOB does not get a pass unless their Server, App, Webpage is Reliable.
I guess the question there is how you measure reliability on an ongoing basis.
Bonus points if you identify how you use SolarWinds tools to measure reliability. ;-)
I wish we did. For this we use another company's SaaS and Application Monitoring.
We have a path from Firewall to Network to Server at which time the other Vendor takes over and measures java and ,net as well as synthetic monitoring where appropriate.
To be fair, most discussions around availability acknowledge that SLAs allow for scheduled downtime. Monthly patching is a great example. Even those VMs reboot much faster than their physical counterparts, a monthly reboot will certainly get you over that 5 minute mark.
I think RichardLetts is dead-on with his comment that availability needs to be monitored at the service level, not the server level. That's the S in SLA, after all! Applications and services spread through multiple systems give you a fighting chance at 5 9's.
Finally, there's a very frank discussion that needs to happen to determine if a 99.999% uptime solution is worth the cost.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.