Management loves uptime, but they rarely want to pay for it. It seems like that line pretty much explains a third of the meetings IT professionals have to sit through.
When we have conversations about uptime, they tend to go something like this:
IT Worker: What are the uptime requirements for this application?
IT Worker: OK, we can do that, but it’s going to cost you about $1,000,000,000,000. What’s the budget code you want me to bill that expense to? (OK, I made up the number, but you get the idea).
Manager: I’m not paying that much money. You have $35 in annual budget. That’s all we can afford from the budget. Make it happen.
IT Worker: We can’t get you 100% uptime for $35. For that we can get 9.9% uptime.
At this point, there’s a long discussion about corporate priorities, company spending, the realities of hardware purchasing costs, physics (the speed of light is important for disaster recovery configurations), and, depending on your corporate environment and how personally people take the conversation, something about someone’s parenting skills may come up.
No matter how the discussion goes, this conversation always comes down to the company's need for uptime versus the company’s willingness to pay for the uptime. When it comes to uptime, there has to be a discussion of cost, because uptime doesn’t happen for free. Some systems are more natural to design uptime for than others. With a web tier, for example, we can scale the solution wider and handle the workload through a load balancer.
But what about the hardware running the VMs running your web tier? What if our VM farm is a two-node farm running at 65% capacity? For day-to-day operations, that’s a decent number. But what happens when one of those nodes fails? Now instead of running at 65% capacity, you’re running at 115% capacity. That’s going to be a problem because 15% of the company’s servers (or more) aren’t going to be running because you don’t have the availability to run them. And depending on the support agreement for your hardware, they could be down for hours or days.
Buying another server may be an expensive operation for a company, but how much is that failed server going to cost the company? We may have planned for availability within the application, but if we don’t think about availability at the infrastructure layer, availability at the application layer may not matter.
The converse goes along with this. If we have a new application critical to our business, and the business doesn’t want to pay for availability, will they be happy with the availability of the application if it goes down because a physical server failed? Will they be OK with the application being down for hours or days because there’s nowhere to run the application? Odds are, they won’t be OK with this sort of outage, but the time to address this is before the outage occurs.
Designing availability for some applications is a lot harder than putting some web servers behind a load balancer. How should HA be handled for file servers, profile servers, database servers, or network links? These quickly become very complex design decisions, but they’re necessary discussions for the systems that need availability. If you build, manage, or own systems that the business cannot afford to go down for a few seconds, much less a few hours, then a discussion about availability, cost, and options needs to happen.