cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Converting Business Requirements for Availability May Require Some Reality Checks

Level 9

Management loves uptime, but they rarely want to pay for it. It seems like that line pretty much explains a third of the meetings IT professionals have to sit through.

When we have conversations about uptime, they tend to go something like this:

     IT Worker: What are the uptime requirements for this application?

     Manager: 100%.

     IT Worker: OK, we can do that, but it’s going to cost you about $1,000,000,000,000. What’s the budget code you want me to bill that expense to? (OK, I made up the number, but you get the idea).

     Manager: I’m not paying that much money. You have $35 in annual budget. That’s all we can afford from the budget. Make it happen.

     IT Worker: We can’t get you 100% uptime for $35. For that we can get 9.9% uptime.

At this point, there’s a long discussion about corporate priorities, company spending, the realities of hardware purchasing costs, physics (the speed of light is important for disaster recovery configurations), and, depending on your corporate environment and how personally people take the conversation, something about someone’s parenting skills may come up.

No matter how the discussion goes, this conversation always comes down to the company's need for uptime versus the company’s willingness to pay for the uptime. When it comes to uptime, there has to be a discussion of cost, because uptime doesn’t happen for free. Some systems are more natural to design uptime for than others. With a web tier, for example, we can scale the solution wider and handle the workload through a load balancer.

But what about the hardware running the VMs running your web tier? What if our VM farm is a two-node farm running at 65% capacity? For day-to-day operations, that’s a decent number. But what happens when one of those nodes fails? Now instead of running at 65% capacity, you’re running at 115% capacity. That’s going to be a problem because 15% of the company’s servers (or more) aren’t going to be running because you don’t have the availability to run them. And depending on the support agreement for your hardware, they could be down for hours or days.

Buying another server may be an expensive operation for a company, but how much is that failed server going to cost the company? We may have planned for availability within the application, but if we don’t think about availability at the infrastructure layer, availability at the application layer may not matter.

The converse goes along with this. If we have a new application critical to our business, and the business doesn’t want to pay for availability, will they be happy with the availability of the application if it goes down because a physical server failed? Will they be OK with the application being down for hours or days because there’s nowhere to run the application? Odds are, they won’t be OK with this sort of outage, but the time to address this is before the outage occurs.

  Designing availability for some applications is a lot harder than putting some web servers behind a load balancer. How should HA be handled for file servers, profile servers, database servers, or network links? These quickly become very complex design decisions, but they’re necessary discussions for the systems that need availability. If you build, manage, or own systems that the business cannot afford to go down for a few seconds, much less a few hours, then a discussion about availability, cost, and options needs to happen.

11 Comments
Level 14

Thanks for the article. FWIW - I would say NEVER claim 100% uptime is possible for anything. 

MVP
MVP

Thanks for the article.

Level 13

OMG this is so true.  I've got a doodle on my whiteboard that's been there for maybe 10 years.  I know one of the new peeps is starting to grok things when they quote it or let me know in some other way that they've seen the light.

  • Cost
  • Reliability
  • Performance
  • <Pick 2>
Level 16

I worked at a place where they did expect 99.999 percent uptime and they were willing to pay for it. Their systems ran like clockwork. Servers could be failed over for maintenance, SQL clusters could be moved, Disaster Recovery tests ran fairly smoothly, etc. Overall they did a pretty

good job. Their testing lab was larger than most companies actual data centers. All this came at a price and wasn't built overnight but in the end it was worth every cent spent.

Level 13

Thanks for the article

Level 12

Having conversations with outside service providers is also needed in many cases depending on the industry you are in. If they can't meet your up time requirement then you need to come up with another solution or live with what they can provide. Thanks for the article.

Level 15

Such a thought provoking article.  The engineer side of me thinks constantly about HA and when and where to deploy load balancers.  It has taken a number of years for our environment to start to get there though.  It seems that everyone wants the uptime but truly getting the $$ is the challenge

Thanks for the posting.

Level 14

Oh this is so true and the bane of my life.  I've just been landed a new project.  Circa 30 people on a new site around the corner.  They need two server racks full of stuff to support the 30 users, three radio studios and 1 TV studio.  They want a minimum of two hours on battery in the event of a power failure.  I worked out all the technical bits and gave them a price for the UPS.  Nope, can't afford that.  What can you give us for 5K.  About 30 minutes.  Not good enough, we need two hours.  They couldn't get the bit where two hours was 4 times the battery and I was offering it for just over twice the 30 mins cost.  I thought I had performed a miracle but they wanted more.  I think I'll offer 5 loaves and two fishes and see what response I get.  I can't do the water into wine as I did the reverse at the weekend and need to recharge.

You've re-identified the age-old problem mice deciding the cat is too dangerous, and that a bell must be put on it so the mice will hear it coming.   No one is willing to commit the resources to accomplish the task.

Belling the Cat - Wikipedia

MVP
MVP

Worked at a company that wanted 99.999% uptime.  Well in the 1st quarter we had 2 vendor outages that killed the 5 nines early on.  Then a series of "unfortunate" mistakes further put a nail in that coffin in that it would have been over the limit even if the vendors hadn't killed the SLA.  I do not miss that sweat shop.

Level 11

Thanks for the article, I've worked at places as well that wanted 99.999% uptime till they were delivered with the reality of the $$$, usually a happy medium is reached except for the few critical business systems.