Before choosing a technical disaster recovery solution

Getting right into the technical nitty gritty of the Disaster Recovery (DR) plan is probably my favorite part of the whole process. I mean, as an IT Professional this is our specialty – developing requirements, evaluating solutions, and implementing products. And while this basic process of deploying software and solutions may work great for single task-oriented, department type applications, we will find that in terms of DR there are many more road blocks and challenges that seem to pop up along the way. And if we don’t properly analyze and dissect our existing production environments, or we fail to involve many of the key stakeholders at play, our DR plan will inevitably fail – and failure during a disaster event could be catastrophic to our organizations and, quite possibly, our careers.

So how do we get started?

Before even looking at software and solutions we really should have a solid handle on the requirements and expectations of our key stakeholders. If your organization already has Service Level Agreements (SLA’s) then you are well on your way to completing this first step. However, if you don’t, then you have a lot of work and conversations ahead of you. In terms of disaster recovery, SLA will drive both the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). An RPO essentially dictates the maximum amount of time in which an organization can incur data loss. For instance, if a service has an RPO of 4 hours we would need to ensure that no matter what we can always restore our service with no more than 4 hours of data loss, meaning we would have to ensure that restore points are created on a 4-hour (or smaller) interval. An RTO dictates the amount of time it takes to get our service restored and running after a failure. Thus, an RTO of 4 hours would essentially mean we have 4 hours to get the service up and running after the notification of a failure before we would begin to massively impact our business objectives.

Determining both RTO and RPO can become a very challenging process and really needs to involve all key stakeholders within the business. Our application owners and users will certainly always demand lower RPO and RTO values, however IT departments may inject a bit of realization into the process when a dollar value is placed on meeting those low RPO/RTOs. The point of the exercise though is to really define what’s right for the organization, what can be afforded, and create formal expectations for the organization.

Once our SLA’s, RTOs, and RPOs have been determined then IT can really get started on determining a technical solution to ensure that these requirements can be met. Hopefully we can begin to see the importance of having the expectations set beforehand. For instance, if we had a mission-critical business service with RTO of 10 minutes then we would most likely not rely on a tape backup to protect that service as it would take much longer than that restore from tape, instead, we would most likely implement some form of replication. On the flip side, a file server, or more specifically the data on the file server, may have an RTO of say 10 hours, at which point it could be cost effective to rely on backup to protect this service. My point is, having RTO and RPO set before beginning any technical discovery is key to getting a proper, cost-effective solution.

What else is there to consider?

Ten years ago, we would be pretty much done our preliminary work for a DR plan by simply determining RTO and RPO and could begin investigating solutions – but in today’s modern datacenters that’s simply not the case. We have a lot more at play. What about cloud?  What about SaaS? What about remote workers? Today’s IT deployments don’t just operate within the 4 walls of our datacenters and are most often stretched into all corners of the world – and we need to protect and adhere to our SLA policies no matter where the workload runs. What if Office 365 suddenly had an outage for 3 hours? Is this acceptable to your organization? Do we need to archive the mail somewhere else so at the very least the CEO can get that important message he needed? Same goes with our workloads that may be running in public clouds like Amazon or Azure – we need to ensure that we are doing all we can to protect and restore these workloads.

The upfront work of looking at our environments holistically, determining our SLAs, and developing RTO and RPO’s really do set IT up for success when it comes time to evaluate a technical solution. Quite often we won’t find just one solution that fits our needs – and in most deployments, we will see many different solutions deployed to satisfy a well-built DR plan. We may have one solution that handles backup of cloud and another that handles on-premises workloads. We could also have one solution that replicates to could, and another that moves workloads to our designated DR site. The point being that by focusing most of our time on the development of RPO, RTO, and business practices really lets the organization, and not IT, drive the disaster recovery processes – which in turn lets IT focus on the technical deployment and solutions built around it.

Thus far we have had two posts regarding developing our DR plan which dictate taking a step back and having discussions with our organizations before even beginning to evaluate and implement anything technical. I’d love to hear feedback on this. How do you begin your DR plans? Have you had those conversations with your organization around developing SLA’s? If so, what challenges present themselves? Quite often organization will look to IT for answers that should really be dictated by the business requirements and processes – what are your feelings on this? Leave me a comment below with your thoughts. Thanks for reading!

  • Another aspect of BC/DR that is often overlooked is the people.  If a disaster strikes in the area where your primary datacenter is (say a big earthquake) and that is also where your people are, who is going to do the work to restore stuff as most BC/DR plans are not fully automated.

  • Thanks for the kind words and congrats on getting the site off the ground!  I hear you on the "discussion" piece.  It's the hardest part trying to get non IT related people to commit to times and numbers - especially when dollars are involved!  Service Level Agreements turn into something called Service Level Expectations emoticons_happy.png

  • I am 80% done with a 2+ year complete top to bottom network redesign and DR capabilities. Each step carefully documented for safe keeping and the obligatory 3 AM " oh ****" call. The primary reason for the uber documentation, it was lacking or inadequate and did not survive the most destructive of disasters, the untimely death of the person who knew it best. It hasnt always been pretty but we are better for having gone through the process. Reality can be a hard task master.

  • We use those and I have posted a custom poller for them so that I can monitor them via SolarWinds. It seems redundant to pay for their services when we are already using SolarWinds and can monitor and alert from there.

  • Year there was the one time at a prior job when they were working on the chillers....  Everything was shut down except for the clustered system that handled POS transactions.  This was during the summer in Texas.  We had the doors open and fans running...we got to within 2 degrees of the point where we would have been forced shut those systems down.  It wa pretty much a miserable shift as I was the operator on duty that weekend day.  The only other person on the floor was the security guard outside.  Most all the lights were out to reduce heat so you only had one set of safety lights in the general corner of the room to utilize. 

Thwack - Symbolize TM, R, and C