Showing results for 
Search instead for 
Did you mean: 
Create Post

Switching Online Services to a Secondary Datacenter

Level 13

In the first post of this series I discussed the importance of establishing recovery time objectives (RTOs) for each component in your production system. In the second post I covered the comparative demands of performing triage during outages of different magnitudes; one that impacts a set of network devices versus an outage that impacts systems in an entire datacenter.

In this post I want to discuss what was merely implied in the last one. When a disaster hits a datacenter, in order to meet RTOs, you might need first to switch-over service from the primary to a secondary datacenter, before coordinating any recoveries within the down datacenter.

Hot/Hot or Hot/Warm?

If you have a production system in which customers merely read static data when they access front-end interfaces, and do not write new data into the database, then you could potentially use a load-balancing technology (for example, F5 ) that routes traffic into web clusters located at two geographically distinct datacenters. In this case, you already have a highly available production system. Similarly, you could have database instances running in different datacenters as part of fully redundant service stacks. And as for feeding in new data, you may have jobs setup to update the database in each datacenter during staggered maintenance windows. In this case, we consider your overall deployment as hot/hot.

However, I’m assuming a system in which customers do create new data through their online sessions. In this case, you can have only one database at a time in an operational state, replicating changes to its warm back-up. DNS directs all traffic to the datacenter where the database is located. While a virtual IP device might sit in front of the cluster of web servers in this datacenter, to provide redundancy and balance load so that this part of the system is very fault tolerant, we still consider your deployment hot/warm.


In a hot/warm architecture, when the primary datacenter goes down, you first need to bring up your database instance in the secondary datacenter. Whatever database you use must have its own specific tools for draining queues of replicated data and opening up the stand-by database for reading and writing.

When all systems are ready in the secondary datacenter, you can edit DNS zone file(s) so that the A record correlates service domains with the appropriate web servers.  As you make that change, you need to monitor the domains; during the black-out period, when requests for DNS cannot be answered, you should see a web page that informs the user that the services are in the process of being redirected. You can expect users to get the redirection notice for the duration of the time-to-live (TTL) associated with the last DNS answers. 5 minutes is a common TTL; authoritative DNS servers and client browsers use cached answers until the TTL expires.

As new traffic comes into the secondary datacenter a good DNS tool helps you verify that the DNS server is passing out the revised resolution for the front-end domains; and a user event simulation tool confirms that DNS is steering visitors accordingly. If the primary datacenter is not entirely down, you can monitor the access logs of your web servers in that site for a report on traffic that is still seeking access to your services through cached DNS answers. Concurrently, operations teams monitor their own pieces of the system—network, web and application, database and storage—for status and alerts.

When all systems are up, and all alerts are resolved, you have successfully switched traffic to the secondary datacenter. It’s now time for the operations teams to fix problems back in the primary site.


Level 15

This provided some useful information and links.  Thank!

Level 12

Hot/hot can be achieve at a number of applications, but the system and work flows need to be designed to handle the fact that there may be a little delay as data is replicated, and what to do if no replication is possible.

Cloud Load balancing across geographically separated reverse proxies with Web Optimization, WAF, and DDoS protection, as also help in regards to making a site seem hot/hot, even if it is really Hot/Warm. One such solution is  Imperva Incapsula. It merely requires a DNS record change to get things rolling, after which you can set how you want to fail over all your sites.

Even with the mentioned solutions, there will always be unforeseen outages out of our control that impact some our all of our target market. All we can do is mitigate the 80-90% and monitor the best we can.

About the Author
If I were a HAL 9000 series computing machine I might be in an operational state on a space vessel somewhere in our little solar system, closer to Jupiter than Earth, with some probability of lethal malfunction; and to understate the obvious, I would not be helping anyone here on But I do or try to help people on watch their bits better. Therefore, I am probably not a HAL 9000 series computing machine. I alternate between feeling ambiguously clear (state='0' if you like) and clearly ambiguous (state='1' as it were). I enjoy verbing nouns.