cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Recovering from a Database Disaster: Planning Makes a Big Difference

Level 12

By Joe Kim, SolarWinds EVP, Engineering and Global CTO

Because databases are so important to federal IT, I wanted to share a blog written earlier this year by my SolarWinds colleague, Thomas LaRock.

Federal IT pros must be prepared for every situation. For most, a database disaster is not a matter of if, but rather when. The solution is simple: have backups. The first step to surviving any database disaster is having a good backup available and ready to restore.

But, agencies shouldn’t assume that having a database backup is enough; while it’s the first step in any successful database disaster recovery plan, it’s certainly not the only step. Agencies should have in place a robust, comprehensive plan that starts with the assumption that a database disaster is inevitable, and build in layers of contingencies to help ensure quick data recovery and continuity of operations.

Let’s look at the building blocks of that plan.

Defining the database disaster recovery plan

There can be a lot of confusion surrounding the terminology used when creating a backup and recovery plan. So much, in fact, that it’s well worth the space to review definitions.

High Availability (HA): This essentially means “uptime.” If your servers have a high uptime percentage, then they are highly available. This uptime is usually the result of building out a series of redundancies regarding critical components of the system.

Disaster Recovery (DR): This essentially means “recovery.” If you are able to recover your data, then you have the makings of a DR plan. Recovery is usually the result of having backups that are current and available.

Recovery Point Objective (RPO): This is the point in time in which you can recover data—if there is a disaster and data is lost—as part of an overall continuity of operations plan. This defines an acceptable amount of data loss based on a time period. The key here is to establish a number based on actual data, and potential data loss.

Recovery Time Objective (RTO): This is the amount of time allowable for you to recover data.

It is important to note that the continuity of operations and recovery plan should include both a recovery point objective (RPO) and a recovery time objective (RTO).

Estimated Time to Restore (ETR): ETR estimates how long it will take to restore your data. This estimate will change as your data grows in size and complexity. Therefore, ETR is the reality upon which the RTO should be based.

Remember to check often to verify that the ETR for your data is less than the RTO before disaster strikes. Otherwise, you will not be as prepared as you should be.

Knowing your IT DR plan

Now that we know the lingo, let’s think about the plan itself. Some may consider that replication as all that’s necessary for successful data recovery. It’s not that simple. As I stated earlier, agencies should have a robust, comprehensive plan that includes layers of contingencies.

Remember that HA is not the same as DR. For an example of why that is, let’s look at a common piece of technology that is often used as both an HA and DR solution.

Take this scenario: You have a corruption at one site. The corruption is immediately replicated to all the other sites. That’s your HA in action. How you recover from this corruption is your DR. The reality is, you are only as good as your last backup.

Another reason for due diligence in creating a layered plan: keep your RPO and RTO in agreement.

For example, perhaps your RPO states that your agency database must be put back to a point in time no more than 15 minutes prior to the disaster, and your RTO is also 15 minutes. That means, if it takes 15 minutes in recovery time to be at a point 15 minutes prior to the disaster, you are going to have up to 30 minutes (and maybe more) of total downtime.

The reality of being down for 30 minutes and not the expected 15 minutes can make a dramatic difference in operations. Research, layers, and coordination between those layers are critical for a successful backup and recovery plan.

A final point to consider in your disaster recovery plan is cost. Prices rise considerably as you try to narrow the RPO and RTO gaps. As you approach zero downtime, and no data loss, your costs skyrocket depending upon the volume of data involved. Uptime is expensive.

Cost is the primary reason some agencies settle for a less-than-robust data recovery plan, and why others sometimes decide that (some) downtime is acceptable. Sure, downtime may be tolerable to some degree, but not having backups or a DR plan in place for when your database will fail? That is never acceptable.

Find the full article on Federal Technology Insider.

13 Comments

This topic is timely--just a couple of hours ago I was talking with a DBA who admitted his company couldn't confirm that his databases were successfully and completely backed up each night.  They go through the motions, run it to their tape robot, rotate tapes off site.

But he was embarrassed that he couldn't say for certain the backups were good ones.  As we all know, "Your backups are only as good as your restores."

I'm no DBA, and my organization doesn't own SRM or DPA or SAM.  What Solarwinds product should I recommend to my DBA friend, so he can be certain his databases are all successfully backed up nightly?

Can any SW module even report on the success of a proper database backup?

What do others use, if not some Solarwinds module?

Level 20

We've struggled with scripts to lock the database before things are backed up for things like ClearQuest Oracle databases.  I've found that the backups work fine without locking the database but it could leave some hanging pointers out there I'm guessing... I've never had a problem with it yet.  We switched over to Commvault a few years ago.

We are fortunate here being completely virtualized and using VMware Site Recovery Manager. We have tested DR several times and we are able to restore with >2hr RTO. The only manual intervention we need to make is creating TEMPDB. We have not yet mastered the automation when the server comes online and our TEMPDB is way too volatile to put it into SRM.

This is a really useful article. Lets all say it together. HA is not DR.

shidoshi1000​, do you have any comments to my questions?

Level 14

I worked in one company who thought disaster recovery was restore the data from the untested backup tapes piled on top of the servers.  We ended up with data replication to disks on another site, that then backed up to tape for remote storage and a contract to lease servers and desktops at short notice should the need arise.  I wasn't the most popular person with the finance department but the head of IT could sleep at night.

MVP
MVP

Thanks for sharing

Level 21

A disaster recover plan is exactly that, it's a plan but also needs to include the technology bit.  It's also something you should be regularly testing to make sure it works and that changes haven't taken place over time that would negate it or compromise it in any way.

MVP
MVP

Having been through numerous audits in my career - NO, I'm not talking IRS comes and takes me away in cuffs - I mean of the HIPPA and PCI varieties. I've come to appreciate their mantra "If it's not documented you aren't doing it." While it hurts to hear that the first 1, 2 or 100 times, eventually it begins to sink in. Writing it down has the additional benefit of engaging a part of the brain that doesn't come into play when you are just doing your job, or as many of us have had to do "winging it."

I remember building a church management system (long ago, on a floppy disk based system) and running it for months myself. When I went to move on to another church I needed to build the "manual" for using the system. It's hard to imagine the things you do on a day to day basis, just out of habit and how long it takes to write down the steps. You do a step, write it down, do another step, write it down, move on to step three and forget where you were and have to go through the steps again. It also reveals "shortcuts" that you may have developed or steps you have begun leaving out. As painful as it may be documentation before, during and after is vital.

Level 14

So depending on whether we are talking SQL Server or Oracle, you could run a script to alert you if backups are not successful.

Here is a post for SQL Server backups. Custom Alert - SQL Server Backup Monitoring

Here is one for Oracle. Custom Alert - Oracle RMAN Backup Failed

Couple of thoughts:

  • These may work for your environment or be able to act as a starting place for monitoring and can be adjusted.
  • These will let you know if backups were not successful, but any backups should periodically be recovered for complete testing.

These scripts could potentially be integrated with SAM as well (SQL User Experience Monitor), but I have not tested them.

HTH.

Thanks!

Level 7

@tinmann0715 

Would be good to hear about your DR solution with VMware SRM. I'm just in the process of doing an implementation of SW and the customer uses VMware SRM for their DR strategy.

So we have 2 VMware sites, WAN links to each site, no stretched VLAN's. As part of the recovery plans we change the IP's to sit in new DC. However SW doesn't seem to recognise the new IP address on the nodes. Whats your approach/plan?

@SWMonITor

   We have nothing but good things to say about SRM. We have 2 DR sites, a slew of hosts, and up to 80 servers being replicated. Like you we do not extend VLAN's. Servers receive different IP's when failed over. You can preset the DR IP's into SRM ahead of time. This will save oodles of times when failing over. As always... test! test! test!

To handle the SolarWinds monitoring issue the only tried and true method I have found is that when the server is failed over to DR is to setup separate monitoring for those IP's complete with any component, service, or templates that you have for those servers while running in PRD. And when those DR servers are turned off then unmanage the node in SW. The other route you can take, and this all depends on how many nodes that you'll be failing over is to just update the IP's to monitor for that node. It's manual and time-consuming, but most effective. You won't get accurate availability/SLA reporting either as time is lost updating the IP's.