Recovering from a Database Disaster: Planning Makes a Big Difference

By Joe Kim, SolarWinds EVP, Engineering and Global CTO

Because databases are so important to federal IT, I wanted to share a blog written earlier this year by my SolarWinds colleague, Thomas LaRock.

Federal IT pros must be prepared for every situation. For most, a database disaster is not a matter of if, but rather when. The solution is simple: have backups. The first step to surviving any database disaster is having a good backup available and ready to restore.

But, agencies shouldn’t assume that having a database backup is enough; while it’s the first step in any successful database disaster recovery plan, it’s certainly not the only step. Agencies should have in place a robust, comprehensive plan that starts with the assumption that a database disaster is inevitable, and build in layers of contingencies to help ensure quick data recovery and continuity of operations.

Let’s look at the building blocks of that plan.

Defining the database disaster recovery plan

There can be a lot of confusion surrounding the terminology used when creating a backup and recovery plan. So much, in fact, that it’s well worth the space to review definitions.

High Availability (HA): This essentially means “uptime.” If your servers have a high uptime percentage, then they are highly available. This uptime is usually the result of building out a series of redundancies regarding critical components of the system.

Disaster Recovery (DR): This essentially means “recovery.” If you are able to recover your data, then you have the makings of a DR plan. Recovery is usually the result of having backups that are current and available.

Recovery Point Objective (RPO): This is the point in time in which you can recover data—if there is a disaster and data is lost—as part of an overall continuity of operations plan. This defines an acceptable amount of data loss based on a time period. The key here is to establish a number based on actual data, and potential data loss.

Recovery Time Objective (RTO): This is the amount of time allowable for you to recover data.

It is important to note that the continuity of operations and recovery plan should include both a recovery point objective (RPO) and a recovery time objective (RTO).

Estimated Time to Restore (ETR): ETR estimates how long it will take to restore your data. This estimate will change as your data grows in size and complexity. Therefore, ETR is the reality upon which the RTO should be based.

Remember to check often to verify that the ETR for your data is less than the RTO before disaster strikes. Otherwise, you will not be as prepared as you should be.

Knowing your IT DR plan

Now that we know the lingo, let’s think about the plan itself. Some may consider that replication as all that’s necessary for successful data recovery. It’s not that simple. As I stated earlier, agencies should have a robust, comprehensive plan that includes layers of contingencies.

Remember that HA is not the same as DR. For an example of why that is, let’s look at a common piece of technology that is often used as both an HA and DR solution.

Take this scenario: You have a corruption at one site. The corruption is immediately replicated to all the other sites. That’s your HA in action. How you recover from this corruption is your DR. The reality is, you are only as good as your last backup.

Another reason for due diligence in creating a layered plan: keep your RPO and RTO in agreement.

For example, perhaps your RPO states that your agency database must be put back to a point in time no more than 15 minutes prior to the disaster, and your RTO is also 15 minutes. That means, if it takes 15 minutes in recovery time to be at a point 15 minutes prior to the disaster, you are going to have up to 30 minutes (and maybe more) of total downtime.

The reality of being down for 30 minutes and not the expected 15 minutes can make a dramatic difference in operations. Research, layers, and coordination between those layers are critical for a successful backup and recovery plan.

A final point to consider in your disaster recovery plan is cost. Prices rise considerably as you try to narrow the RPO and RTO gaps. As you approach zero downtime, and no data loss, your costs skyrocket depending upon the volume of data involved. Uptime is expensive.

Cost is the primary reason some agencies settle for a less-than-robust data recovery plan, and why others sometimes decide that (some) downtime is acceptable. Sure, downtime may be tolerable to some degree, but not having backups or a DR plan in place for when your database will fail? That is never acceptable.

Find the full article on Federal Technology Insider.

Parents
  •  

    Would be good to hear about your DR solution with VMware SRM. I'm just in the process of doing an implementation of SW and the customer uses VMware SRM for their DR strategy.

    So we have 2 VMware sites, WAN links to each site, no stretched VLAN's. As part of the recovery plans we change the IP's to sit in new DC. However SW doesn't seem to recognise the new IP address on the nodes. Whats your approach/plan?

Comment
  •  

    Would be good to hear about your DR solution with VMware SRM. I'm just in the process of doing an implementation of SW and the customer uses VMware SRM for their DR strategy.

    So we have 2 VMware sites, WAN links to each site, no stretched VLAN's. As part of the recovery plans we change the IP's to sit in new DC. However SW doesn't seem to recognise the new IP address on the nodes. Whats your approach/plan?

Children
No Data
Thwack - Symbolize TM, R, and C