Showing results for 
Search instead for 
Did you mean: 
Create Post

How Available Are Your Production Systems?

Level 13

You can't usefully determine the necessary availability of a system until you know what level of service you expect from it. The meaning of "high availability" depends on what you need a system to do for you, how often, how quickly; and on the consequences of your expectations being unmet.

Business continuity planning is that part of operations engineering that implements safeguards within a system based on an understanding of acceptable risk.

During trading hours, for example, the New York Stock Exchange consistently requires very low transaction latency and 100% uptime. Visa requires 99.999% uptime from its core systems and network all the time. With very low acceptable risks, these systems warrant almost any effort that mitigates risk through fault tolerance; redundancy of components, real-time replication of critical data, and especially auto-failover strategies.

In contrast, during a recent outage, customers of Godaddy web hosting services were probably surprised to discover the costs of outsourcing their IT needs--in the form of having accepted that their host's marketing boasts about infrastructure in place of an actual service level agreement (SLA) with terms for reimbursement.

Defining a Recovery Time Objective (RTO)

Justifying the costs of up-time means first knowing the point at which the business the system supports becomes negatively impacted. While NYSE and Visa are immediately impacted, your business may be able to go 30 minutes to some hours before the inaccessibility of customer-facing systems results in one or more of the three biggest kinds of loss--reputation, data, money.

If your impact analysis determines that your production system can comfortably withstand 30 minutes of downtime, then 30 minutes becomes your RTO. Everything you do by way of keeping your systems operational implicitly occurs against that RTO.

Since very few systems are capable of an entirely automatic response to an operational issue, recovery strategy most likely involves IT engineers performing triage based on alerts.

Ideally, your monitoring and alerting system gives you an nearly immediate indication of a system issue. You should be able to think of the alert about an issue as the T-Zero for meeting your RTO.

Have a look at SolarWinds networking, storage, and systems monitoring products as providers of the critical T-Zero for meeting your different recovery time objectives.


1 Comment
Level 15

Interesting article.