The Devil is in the Details

I've been responsible for Disaster Recovery and Business Continuity Preparedness for my company for nearly eight years. In my role as Certified Business Continuity Professional, I have conducted well over a dozen DR exercises in various scopes and scale. Years ago, I inherited a complete debacle. Like almost all other Disaster Recovery Professionals, I am always on the lookout for better means and methods to further strengthen and mature our DR strategy and processes. So, I roll my eyes and chuckle when I hear about all of these DRaaS solutions or DR software packages.

My friends, I am here to tell you that to oversee a mature and reliable DR program the devil is in the details. The bad news is that there is no real quick fix, one-size-fits- all, magic wand solution available that will allow you to put that proverbial check in the "DR" box. Much more is needed. Just about all the DR checklists and white papers I've ever downloaded, at the risk of being harassed by the sponsoring vendor, pretty much give the same recommendations. What they neglect to mention are the specifics, the intangibles, the details that will make or break a DR program.

First, test. Testing is great, important, and required. But before you schedule that test and ask the IT department, as well as many members from your business units, to participate and waste a portion of their weekend, you darn well better be ready. Remember, it is your name that is on this exercise. You don’t want to have to go back however man months later and ask your team to give up another weekend to participate. Having to test processes only to fail hard after the first click will quickly call your expertise into question.

Second, trust but verify. If you are not in direct control of the mission critical service, then you audit and interview those who are responsible, and do not take their word for it when they say, "It'll work." Ask questions, request a demonstration, look at screens, walk through scenarios and always ask, "What if...?"

Third, work under the assumption that the SMEs aren't always available. Almost every interview has a Single Point of Failure (SPoF) by the third "What if?" question.

"Where are the passwords for the online banking interfaces stored?"

"Oh! Robert knows all of them." answered the Director of Accounts Payable.

"What if Robert is on vacation on an African safari?"

"Oh!" said the director. "That would be a problem."

"What if we didn't fulfill our financial obligations for one day? Two or three days. A week?" I asked.

"Oh! That would be bad. Real bad!"

Then comes the obligatory silence as I wait for it. "I need to do something about that." Make sure you scribble that down in your notes and document it in your final summary.

Fourth, ensure the proper programming for connectivity, IP/DNS, and parallel installations. This is where you will earn your keep. While the DRaaS software vendors will boast of simplicity, the reality is that they can only simplify connectivity so much. Are your applications programmed to use IP and not FQDNS? Does your B2B use FTP via public IP or DNS? And do they have redundant entries for each data center? The same question can be applied to VPNs. And don't forget parallel installations, including such devices as load balancers and firewalls. Most companies must manually update the rules for both PRD and DR. I've yet to meet a disciplined IT department that maintains both instances accurately.

Fifth, no one cares about DR as much as you do. This statement isn't always true, but always work under the assumption that it is. Some will care a whole lot. Others will care a little. Most will hardly care at all. It is your job to sell the importance of testing your company's DR readiness. I consistently promote our company's DR readiness, even when the next exercise isn't scheduled. My sales pitch is to remind IT that our 5,000+ employees are counting on us. People's mortgage payments, health insurance, children’s tuition all rely on paychecks. It is our duty to make sure our mission-critical applications are always running so that revenue can be earned and associates can receive those paychecks. This speech works somewhat because, let's face it, these exercises are usually a nuisance. While many IT projects push the business ahead, DR exercises are basically an insurance policy.

Sixth, manage expectations. This is pretty straightforward, but keep in mind that each participant has his/her own set of expectations. Whether it be the executives, the infrastructure teams and service owners, or the functional testers. For example, whenever an executive utters the words "pass" or "fail," immediately correct them by saying, "productive," reminding them that there is no pass/fail. Three years ago I conducted a DR exercise that came to a dead stop the moment we disconnected the data center's network connectivity. Our replication software was incorrectly configured. The replicators in DR needed to be able to talk to the Master in our production data center. All the participants were saying that the exercise was a failure, which triggered a certain level of panic. I corrected them and said, "I believe this was our finest hour!" Throughout your career, you should be prepared to correct people and help manage their expectations.

Seventh, delegate and drive accountability. Honestly, this isn't my strong suit. With every exercise that I have conducted, the lead-up and prep often finds dozens of gaps and showstoppers. What I need to be better at doing is holding the service owners accountable and delegate the responsibility of remediation when a gap or showstopper is identified. Instead, I often fall back on my 20+ year IT background and try to fix it myself. This consumes my time AND lets the service owners off the hook. For example, while prepping for my most recent exercise, I learned that a 2TB disk drive that contains critical network shares had stopped replicating months ago. The infrastructure manager told me that the drive was too big and volatile and that it was consuming bandwidth and causing other servers to fail their RPO. Once I got over my urge to scream, I asked what the space threshold was that needed to be reached to be able to turn the replication back on. I then asked him what he could do to reduce disk space. He shrugged and said, "I don't know what is important and what isn't." So, I took the lead and identified junk data and reduced disk space by 60 percent. I should have made him own the task, but instead took the path of least resistance.

Eight, documentation. Very few organizations have it. And those who do have documentation usually have only what is obsolete. The moment it is written down, some detail has changed. Also, what I have learned is that very few people refer to documentation after it is created.

So, there you have it. I have oodles more, but this article is long enough already. I hope you find what I shared useful in some capacity. And remember, when it comes to DR exercises, the devil is in the details.

  • You are correct.See my above response as it applies. Another point to add is that you have to get everyone involved. When new technology is been added or existing tech is being modified the same question is repeatedly asked, "Will this work in DR?" This signifies adoption and ownership, which makes me eternally grateful.

  • Hence my blog title...

       Your DR strategy should not exist inside a vacuum. It should be a consideration for Change Mgmt. Every time there is a change the change is scrutinized from a DR perspective. Regular exercises keeps service owners on their toes. No one likes to fail.  They will do what they can to make sure their services meet RPO/RTO.

  • Tabletop exercises are great for exploring these scenarios. What do you do when you have job abandonment? What is the expectations? Do you have SPoF at key positions?

      Walk it through and then talk it out. Then document your findings and submit to leadership. Schedule a follow-up meeting to review. They will provide direction and determine risk.

  • Several years back I worked at a company that was 300 feet off a major highway. We had a very bad wind storm with a lot of damage. Got called in to work on the generator for the data center and the police had blocked the road and wouldn't let anyone in that wasn't a resident of that city. I wasn't a resident so they wouldn't let me through. Explained I was there to work on the generator and they let me through. Most of the IT staff was turned away.

  • DR and BC are difficult to pin down and the worst part is that it never is complete. It has to be a "living" document to stay current with the changes day to day.

Thwack - Symbolize TM, R, and C