The Devil is in the Details

I've been responsible for Disaster Recovery and Business Continuity Preparedness for my company for nearly eight years. In my role as Certified Business Continuity Professional, I have conducted well over a dozen DR exercises in various scopes and scale. Years ago, I inherited a complete debacle. Like almost all other Disaster Recovery Professionals, I am always on the lookout for better means and methods to further strengthen and mature our DR strategy and processes. So, I roll my eyes and chuckle when I hear about all of these DRaaS solutions or DR software packages.

My friends, I am here to tell you that to oversee a mature and reliable DR program the devil is in the details. The bad news is that there is no real quick fix, one-size-fits- all, magic wand solution available that will allow you to put that proverbial check in the "DR" box. Much more is needed. Just about all the DR checklists and white papers I've ever downloaded, at the risk of being harassed by the sponsoring vendor, pretty much give the same recommendations. What they neglect to mention are the specifics, the intangibles, the details that will make or break a DR program.

First, test. Testing is great, important, and required. But before you schedule that test and ask the IT department, as well as many members from your business units, to participate and waste a portion of their weekend, you darn well better be ready. Remember, it is your name that is on this exercise. You don’t want to have to go back however man months later and ask your team to give up another weekend to participate. Having to test processes only to fail hard after the first click will quickly call your expertise into question.

Second, trust but verify. If you are not in direct control of the mission critical service, then you audit and interview those who are responsible, and do not take their word for it when they say, "It'll work." Ask questions, request a demonstration, look at screens, walk through scenarios and always ask, "What if...?"

Third, work under the assumption that the SMEs aren't always available. Almost every interview has a Single Point of Failure (SPoF) by the third "What if?" question.

"Where are the passwords for the online banking interfaces stored?"

"Oh! Robert knows all of them." answered the Director of Accounts Payable.

"What if Robert is on vacation on an African safari?"

"Oh!" said the director. "That would be a problem."

"What if we didn't fulfill our financial obligations for one day? Two or three days. A week?" I asked.

"Oh! That would be bad. Real bad!"

Then comes the obligatory silence as I wait for it. "I need to do something about that." Make sure you scribble that down in your notes and document it in your final summary.

Fourth, ensure the proper programming for connectivity, IP/DNS, and parallel installations. This is where you will earn your keep. While the DRaaS software vendors will boast of simplicity, the reality is that they can only simplify connectivity so much. Are your applications programmed to use IP and not FQDNS? Does your B2B use FTP via public IP or DNS? And do they have redundant entries for each data center? The same question can be applied to VPNs. And don't forget parallel installations, including such devices as load balancers and firewalls. Most companies must manually update the rules for both PRD and DR. I've yet to meet a disciplined IT department that maintains both instances accurately.

Fifth, no one cares about DR as much as you do. This statement isn't always true, but always work under the assumption that it is. Some will care a whole lot. Others will care a little. Most will hardly care at all. It is your job to sell the importance of testing your company's DR readiness. I consistently promote our company's DR readiness, even when the next exercise isn't scheduled. My sales pitch is to remind IT that our 5,000+ employees are counting on us. People's mortgage payments, health insurance, children’s tuition all rely on paychecks. It is our duty to make sure our mission-critical applications are always running so that revenue can be earned and associates can receive those paychecks. This speech works somewhat because, let's face it, these exercises are usually a nuisance. While many IT projects push the business ahead, DR exercises are basically an insurance policy.

Sixth, manage expectations. This is pretty straightforward, but keep in mind that each participant has his/her own set of expectations. Whether it be the executives, the infrastructure teams and service owners, or the functional testers. For example, whenever an executive utters the words "pass" or "fail," immediately correct them by saying, "productive," reminding them that there is no pass/fail. Three years ago I conducted a DR exercise that came to a dead stop the moment we disconnected the data center's network connectivity. Our replication software was incorrectly configured. The replicators in DR needed to be able to talk to the Master in our production data center. All the participants were saying that the exercise was a failure, which triggered a certain level of panic. I corrected them and said, "I believe this was our finest hour!" Throughout your career, you should be prepared to correct people and help manage their expectations.

Seventh, delegate and drive accountability. Honestly, this isn't my strong suit. With every exercise that I have conducted, the lead-up and prep often finds dozens of gaps and showstoppers. What I need to be better at doing is holding the service owners accountable and delegate the responsibility of remediation when a gap or showstopper is identified. Instead, I often fall back on my 20+ year IT background and try to fix it myself. This consumes my time AND lets the service owners off the hook. For example, while prepping for my most recent exercise, I learned that a 2TB disk drive that contains critical network shares had stopped replicating months ago. The infrastructure manager told me that the drive was too big and volatile and that it was consuming bandwidth and causing other servers to fail their RPO. Once I got over my urge to scream, I asked what the space threshold was that needed to be reached to be able to turn the replication back on. I then asked him what he could do to reduce disk space. He shrugged and said, "I don't know what is important and what isn't." So, I took the lead and identified junk data and reduced disk space by 60 percent. I should have made him own the task, but instead took the path of least resistance.

Eight, documentation. Very few organizations have it. And those who do have documentation usually have only what is obsolete. The moment it is written down, some detail has changed. Also, what I have learned is that very few people refer to documentation after it is created.

So, there you have it. I have oodles more, but this article is long enough already. I hope you find what I shared useful in some capacity. And remember, when it comes to DR exercises, the devil is in the details.

Anonymous
Parents
  • My biggest thing with DR is documentation. I apply my documentation methods to any documentation I create, and I strive everyone else to do the same, and I feel this especially applies to Disaster Recovery. DR may literally involve a disaster, and your head person who knows the processes and procedures inside and out may be unable to perform their duty due to what ever the disaster is. At this point, you still have to move forward with the DR or the business may completely go under, and this is where the documentation and my mentality towards it come into play. Your documentation should be very thorough and detailed, to the point where someone with absolutely no IT experience what so ever can follow your instructions and complete the task. Some people have told me this is overkill and a waste of time and resources, but I have had this come in very handy at times. Even when I refer to my own documentation I appreciate the fact that I wrote it in a way pretty much anyone can understand, because there are times I have been woken up at horrible hours of the morning that should not exist to fix issues and had to resort to documentation. In these cases it was easy because my brain was not exactly awake yet so I didn't have to put much thought process into it.

    I will use our recent weather event we had blanket the midwest region this last weekend. In my area alone we got 25 inches of heavy wet blowing drifting snow in a 55 hour period. There were multiple road closures and several businesses had ceiling collapses. Imagine if the ceiling caved in with 25 inches of snow on your main datacenter and you had to spin up your DR scenario, and your main guy lives on a road that has been shut down due to 8 foot snow drifts on it. Unless your guy owned a snowmobile he won't be getting there any time soon. I couldn't get out of my alley until late Monday morning and was 4 hours late to work, if there was an emergency where I needed to be on site, I likely could not have managed to get in without massive effort, and likely walking it the whole way.

    That ended up being longer winded then I planned it to be. Basically do everyone a favor and make your documentation thorough and complete, and make it so someone with no IT experience can follow it from start to finish.

Comment
  • My biggest thing with DR is documentation. I apply my documentation methods to any documentation I create, and I strive everyone else to do the same, and I feel this especially applies to Disaster Recovery. DR may literally involve a disaster, and your head person who knows the processes and procedures inside and out may be unable to perform their duty due to what ever the disaster is. At this point, you still have to move forward with the DR or the business may completely go under, and this is where the documentation and my mentality towards it come into play. Your documentation should be very thorough and detailed, to the point where someone with absolutely no IT experience what so ever can follow your instructions and complete the task. Some people have told me this is overkill and a waste of time and resources, but I have had this come in very handy at times. Even when I refer to my own documentation I appreciate the fact that I wrote it in a way pretty much anyone can understand, because there are times I have been woken up at horrible hours of the morning that should not exist to fix issues and had to resort to documentation. In these cases it was easy because my brain was not exactly awake yet so I didn't have to put much thought process into it.

    I will use our recent weather event we had blanket the midwest region this last weekend. In my area alone we got 25 inches of heavy wet blowing drifting snow in a 55 hour period. There were multiple road closures and several businesses had ceiling collapses. Imagine if the ceiling caved in with 25 inches of snow on your main datacenter and you had to spin up your DR scenario, and your main guy lives on a road that has been shut down due to 8 foot snow drifts on it. Unless your guy owned a snowmobile he won't be getting there any time soon. I couldn't get out of my alley until late Monday morning and was 4 hours late to work, if there was an emergency where I needed to be on site, I likely could not have managed to get in without massive effort, and likely walking it the whole way.

    That ended up being longer winded then I planned it to be. Basically do everyone a favor and make your documentation thorough and complete, and make it so someone with no IT experience can follow it from start to finish.

Children
No Data