17 Replies Latest reply on Aug 12, 2013 5:43 PM by dsanders@path.org

    The Myth of the 5 9's

    Leon Adato

      I just posted this over on my personal blog (The Myth of Five-Nines | AdatoSystems Monitoring), but I'm sharing it here too. Hopefully you can use it to help explain to colleagues, customers or management why five-nines reliability may NOT be a goal to strive for nor a promise to make.

      ******************************************

       

      “Five-Nines” refers to something that is available 99.999% of the time. It’s become a catch phrase in various parts of the IT industry.

      It’s also complete ********.

      Sean Hull did a great job explaining why five-nine’s is over-rated in this post. But my point is NOT that this level of reliability is expensive. It’s that it’s nearly impossible to functionally achieve.

      I’m also saying that the demands for (or the claims of) “Reliability in the five-nines” are highly over-blown.

      Let’s do the math.

      • In a single minute, 5-9s means you would be unavailable just .0006 of a second.
      • In an hour, you could have .036 seconds of downtime
      • In a day, your system would get .86 seconds of breathing room
      • In a week, you could take a 6.04 second break before being less than 5-9′s
      • In an entire month, you’d only get 24.192 seconds of downtime
      • In any given fiscal quarter, you could expect just over a minute – 72.576 seconds, to be precise – of an outage
      • Half a year? You get over two minutes – 145.152 seconds – where that system was not available
      • And in a whole year, your 5-9′s system would experience just under 5 minutes (290.304 seconds) of outage. Total. Over the entire year.

      You seriously expect any device, server or service to be available to all users for all but 5 minutes in an entire year?

      This has implications for us as monitoring professionals. After all, we’re the ones tasked with watching the systems, raising the flag when things become unavailable. When someone is less than 99.999% available, we’re the ones the system owners come to, asking us to “paint the roses green”. We’re the ones that will have to re-check our numbers, re-calculate our thresholds, and re-explain for the thousandth time that “availability” always carries with it observational bias.

      Yes, Mr. CEO, the server was up. It was up and running in a datacenter where a gardener with backhoe had severed the WAN circuit; it was up and running and everyone in the country could see it except for you, because wifi was turned off on your laptop; it was up and running but it showed “down” in monitoring  because someone changed the firewall rules so that my monitoring server could no longer reach it…

      There’s an even more pressing fact-on-the-ground: Polling cycles are neither constant nor instant. Realistic polling intervals sit at around 1-2 minutes for “ping” type checks, and 5 minutes for data collection. If I’m only checking the status of a server every minute, and my monitoring server is dealing with more than that one machine, the reality is that I won’t be cutting a ticket for a “down” device for 3-5 minutes. That blows your 5-9′s out of the water right there.

      But all of that is beside the point. First you need to let me know if your corporate server team is down with 5-9′s availability guarentees. Are they promising that the monthly patch and reboot process will take less than half a minute, end to end?

      I’m thinking ‘no’.

        • Re: The Myth of the 5 9's
          laserbeemer

          One other thought is focusing just on the 5 nines... It might be up... but it might not be running worth ****.

          • Re: The Myth of the 5 9's
            RichardLetts

            Instead of thinking about the availability of a switch, router, or web server instead think about the availability of the service.

             

            if you're running multi-active data centers were user requests can be serviced out of any of them then you can do quite substantial work in a data center without affecting users, in that case you can quite easily get 5-9s operationally for a service, even if components are taken out of service. Where there is some difficulty is in the application design to ensure that it can operate in this mode, where the designer makes the assumption that infrastructure is built of unreliable equipment and human error.

             

            (as an aside I've found that computer science graduates often start with "Let's assume we have infinite storage, bandwidth, and the hardware has 5-9s of reliability ", whereas engineering graduates start with "let's assume that the network is down, the power supply is uncertain, and the code was written by a computer scientist..." )

            • Re: The Myth of the 5 9's
              rharland2012

              Part of a server team here, and we make no such guarantees. We do provide very good uptime, but not 99.999.

               

               

              I seem to recall that the origin of the '5 9s' - or at least the first time I heard it - was around carriers and circuit delivery. The tier-1 carriers, with robust infrastructure and redundancy of their own, can definitely get closer to this than an application/service team could. The carriers are being a little disingenuous when they talk about uptime, too. We're moving away from these problems, but anyone remember the days of over-committed backhauls? I worked in some school districts in a very rural area for some time, and our carrier was a regional CLEC. What I found out after a few months was that they were oversubscribing bonded-T customers like it was going out of style, but only had one pipe for backhaul - a DS3. After dealing with a couple of puzzling latency issues, it comes out that there are 40-something Ts on that DS-3. Too many sites try to grab too much data, and there you go - chunk-style. That chunk-style uptime, however, was still uptime and contributed toward a positive for the carrier's SLA. Plus, as long as no one voted with their feet (and who could - it was the middle of nowhere with a dearth of broadband choices), the carrier wouldn't have to invest in beefing up the backhaul.

              1 of 1 people found this helpful
              • Re: The Myth of the 5 9's
                bsciencefiction.tv

                Our focus at our FI is more Reliability than Availability. 

                 

                Picture if we were building vases.  It does not matter if the machine runs non stop if it is turning out every 5th vase broken.

                 

                While we do measure availability, the LOB does not get a pass unless their Server, App, Webpage is Reliable.

                • Re: The Myth of the 5 9's
                  michael stump

                  To be fair, most discussions around availability acknowledge that SLAs allow for scheduled downtime. Monthly patching is a great example. Even those VMs reboot much faster than their physical counterparts, a monthly reboot will certainly get you over that 5 minute mark.

                   

                  I think RichardLetts is dead-on with his comment that availability needs to be monitored at the service level, not the server level. That's the S in SLA, after all! Applications and services spread through multiple systems give you a fighting chance at 5 9's.

                   

                  Finally, there's a very frank discussion that needs to happen to determine if a 99.999% uptime solution is worth the cost.

                  1 of 1 people found this helpful
                  • Re: The Myth of the 5 9's
                    rharland2012

                    Does anyone in this conversation know of a specific five-nines offering their business currently pays for? I don't really see it that much out in the wild.

                    • Re: The Myth of the 5 9's
                      Leon Adato

                      Not pays for, except in the sense that it's a common demand/expectation made by "requesters" (read "management") to providers. Or it's a claim made by design teams as they propose and build solutions.

                       

                      That was the original source of frustration that lead to me writing the article.

                      • Re: The Myth of the 5 9's
                        cahunt

                        We have an effort of 3 and 4 nines when it comes to service. But any major outage blows that out of the water. Beyond that a large institution has so many points of failure that just a small office being offline could cut your numbers if your thresholds are offset.  In the never ending battle of management and the number crunchers you are always faced with the expectations outweighing the capabilities.

                        Blame drive thru's for this instant and always on of an idea that gluttonous users can not get over.

                        • Re: The Myth of the 5 9's
                          sql_sasquatch

                          5 9's of availability takes planning, effort, and cost commitment.  But its not a myth.  Tandem provided fault tolerant servers and a hot patchable OS.  Now Stratus (and similar through NEC) are the main players in fault tolerant servers.  http://www.stratus.com/Products/ftServerSystems

                          Oracle is making a play in the hot patchable OS, by applying ksplice.  The technology is out there, though it still requires that entire systems, including personnel, are structured around it.  In some cases, the cost is matched by value.  In many cases its not.

                          • Re: The Myth of the 5 9's
                            dsanders@path.org

                            I don't argue your basic position, since you're right and make a perfectly valid point, but I'd point out that less than 5 minutes of downtime is quite possible, given the right conditions.

                             

                            I spent ten years working for Tandem Computers (now HP NonStop) and I can attest to more than one production system that had less than 5 minutes of downtime in a year.  Consider those core, critical systems like 911, and ATM networks where downtime is not permitted...

                             

                            There is where you find Tandem Systems where there is no single point of failure either in the hardware or the software, and barring human error (usually the cause of that lost 5 minutes,) mostly they don't go down.