In the first post of this series I discussed the importance of establishing recovery time objectives (RTOs) for each component in your production system. In the second post I covered the comparative demands of performing triage during outages of different magnitudes; one that impacts a set of network devices versus an outage that impacts systems in an entire datacenter.
In this post I want to discuss what was merely implied in the last one. When a disaster hits a datacenter, in order to meet RTOs, you might need first to switch-over service from the primary to a secondary datacenter, before coordinating any recoveries within the down datacenter.
Hot/Hot or Hot/Warm?
If you have a production system in which customers merely read static data when they access front-end interfaces, and do not write new data into the database, then you could potentially use a load-balancing technology (for example, F5 ) that routes traffic into web clusters located at two geographically distinct datacenters. In this case, you already have a highly available production system. Similarly, you could have database instances running in different datacenters as part of fully redundant service stacks. And as for feeding in new data, you may have jobs setup to update the database in each datacenter during staggered maintenance windows. In this case, we consider your overall deployment as hot/hot.
However, I’m assuming a system in which customers do create new data through their online sessions. In this case, you can have only one database at a time in an operational state, replicating changes to its warm back-up. DNS directs all traffic to the datacenter where the database is located. While a virtual IP device might sit in front of the cluster of web servers in this datacenter, to provide redundancy and balance load so that this part of the system is very fault tolerant, we still consider your deployment hot/warm.
In a hot/warm architecture, when the primary datacenter goes down, you first need to bring up your database instance in the secondary datacenter. Whatever database you use must have its own specific tools for draining queues of replicated data and opening up the stand-by database for reading and writing.
When all systems are ready in the secondary datacenter, you can edit DNS zone file(s) so that the A record correlates service domains with the appropriate web servers. As you make that change, you need to monitor the domains; during the black-out period, when requests for DNS cannot be answered, you should see a web page that informs the user that the services are in the process of being redirected. You can expect users to get the redirection notice for the duration of the time-to-live (TTL) associated with the last DNS answers. 5 minutes is a common TTL; authoritative DNS servers and client browsers use cached answers until the TTL expires.
As new traffic comes into the secondary datacenter a good DNS tool helps you verify that the DNS server is passing out the revised resolution for the front-end domains; and a user event simulation tool confirms that DNS is steering visitors accordingly. If the primary datacenter is not entirely down, you can monitor the access logs of your web servers in that site for a report on traffic that is still seeking access to your services through cached DNS answers. Concurrently, operations teams monitor their own pieces of the system—network, web and application, database and storage—for status and alerts.
When all systems are up, and all alerts are resolved, you have successfully switched traffic to the secondary datacenter. It’s now time for the operations teams to fix problems back in the primary site.