Scaling your Recovery Preparation

In a previous post I discussed the idea that making your production system optimally available requires knowing the recovery time objective (RTO) of each component. Triage of any operational issue then becomes an exercise in meeting the RTO for each impacted component and for the system overall.

In this post I cover the recovery challenges for outages of different scale. I’m assuming that a significant part of your business is web-based; meaning that company employees, consultants, contractors as well as customers for the company’s products and services all access networked resources through web interfaces.

Multiple Network Devices Impacted

Let’s take the typical example of a bad configuration file pushed to multiple network devices. The misconfigurations potentially impact any computer on the network whose packets go through those devices.

A combination of SNMP-polled status and forwarded syslog alerts show up in your monitoring console. If users can report their specific problems then you get clues for troubleshooting the larger issue and an evolving idea of its impact. For example, users unable to email but who can call you on their IP phone steer you a review of gateway devices for the SMTP server. If doing so is possible, correlating alerts with user complaints helps in creating a triage queue.

Assuming your monitoring system generates and escalates alerts , and your network configuration management system includes a trustworthy repository of configs and a history of config changes, then you are well on your way to solving a misconfiguration issue within a team of three during the course of an hour.

An Entire Datacenter Impacted

If the datacenter that hosts your equipment goes down, then the challenges in meeting your RTOs increase dramatically. You need more than just coordination within your network operations team. Systerm administrators, network engineers, database administrators, and storage engineers are all scrambling simultaneously.

Ideally, your operations center serves as the point of integration for all concurrent troubleshooting, so that the layers of the production platform—network devices, database and storage, applications servers, web servers—are brought up in the right order. Coordinating triage on this scale requires a operations console that rolls-up status and alerts from all tools at once.

Of course, prior to solving problems in the down datacenter, you have already switched production services to systems in your back-up datacenter, right? In a follow-up post I’ll discuss some aspects of that switchover process.

pastedImage_1.png

Thwack - Symbolize TM, R, and C