cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

The Devil is in the Details

I've been responsible for Disaster Recovery and Business Continuity Preparedness for my company for nearly eight years. In my role as Certified Business Continuity Professional, I have conducted well over a dozen DR exercises in various scopes and scale. Years ago, I inherited a complete debacle. Like almost all other Disaster Recovery Professionals, I am always on the lookout for better means and methods to further strengthen and mature our DR strategy and processes. So, I roll my eyes and chuckle when I hear about all of these DRaaS solutions or DR software packages.

My friends, I am here to tell you that to oversee a mature and reliable DR program the devil is in the details. The bad news is that there is no real quick fix, one-size-fits- all, magic wand solution available that will allow you to put that proverbial check in the "DR" box. Much more is needed. Just about all the DR checklists and white papers I've ever downloaded, at the risk of being harassed by the sponsoring vendor, pretty much give the same recommendations. What they neglect to mention are the specifics, the intangibles, the details that will make or break a DR program.

First, test. Testing is great, important, and required. But before you schedule that test and ask the IT department, as well as many members from your business units, to participate and waste a portion of their weekend, you darn well better be ready. Remember, it is your name that is on this exercise. You don’t want to have to go back however man months later and ask your team to give up another weekend to participate. Having to test processes only to fail hard after the first click will quickly call your expertise into question.

Second, trust but verify. If you are not in direct control of the mission critical service, then you audit and interview those who are responsible, and do not take their word for it when they say, "It'll work." Ask questions, request a demonstration, look at screens, walk through scenarios and always ask, "What if...?"

Third, work under the assumption that the SMEs aren't always available. Almost every interview has a Single Point of Failure (SPoF) by the third "What if?" question.

"Where are the passwords for the online banking interfaces stored?"

"Oh! Robert knows all of them." answered the Director of Accounts Payable.

"What if Robert is on vacation on an African safari?"

"Oh!" said the director. "That would be a problem."

"What if we didn't fulfill our financial obligations for one day? Two or three days. A week?" I asked.

"Oh! That would be bad. Real bad!"

Then comes the obligatory silence as I wait for it. "I need to do something about that." Make sure you scribble that down in your notes and document it in your final summary.

Fourth, ensure the proper programming for connectivity, IP/DNS, and parallel installations. This is where you will earn your keep. While the DRaaS software vendors will boast of simplicity, the reality is that they can only simplify connectivity so much. Are your applications programmed to use IP and not FQDNS? Does your B2B use FTP via public IP or DNS? And do they have redundant entries for each data center? The same question can be applied to VPNs. And don't forget parallel installations, including such devices as load balancers and firewalls. Most companies must manually update the rules for both PRD and DR. I've yet to meet a disciplined IT department that maintains both instances accurately.

Fifth, no one cares about DR as much as you do. This statement isn't always true, but always work under the assumption that it is. Some will care a whole lot. Others will care a little. Most will hardly care at all. It is your job to sell the importance of testing your company's DR readiness. I consistently promote our company's DR readiness, even when the next exercise isn't scheduled. My sales pitch is to remind IT that our 5,000+ employees are counting on us. People's mortgage payments, health insurance, children’s tuition all rely on paychecks. It is our duty to make sure our mission-critical applications are always running so that revenue can be earned and associates can receive those paychecks. This speech works somewhat because, let's face it, these exercises are usually a nuisance. While many IT projects push the business ahead, DR exercises are basically an insurance policy.

Sixth, manage expectations. This is pretty straightforward, but keep in mind that each participant has his/her own set of expectations. Whether it be the executives, the infrastructure teams and service owners, or the functional testers. For example, whenever an executive utters the words "pass" or "fail," immediately correct them by saying, "productive," reminding them that there is no pass/fail. Three years ago I conducted a DR exercise that came to a dead stop the moment we disconnected the data center's network connectivity. Our replication software was incorrectly configured. The replicators in DR needed to be able to talk to the Master in our production data center. All the participants were saying that the exercise was a failure, which triggered a certain level of panic. I corrected them and said, "I believe this was our finest hour!" Throughout your career, you should be prepared to correct people and help manage their expectations.

Seventh, delegate and drive accountability. Honestly, this isn't my strong suit. With every exercise that I have conducted, the lead-up and prep often finds dozens of gaps and showstoppers. What I need to be better at doing is holding the service owners accountable and delegate the responsibility of remediation when a gap or showstopper is identified. Instead, I often fall back on my 20+ year IT background and try to fix it myself. This consumes my time AND lets the service owners off the hook. For example, while prepping for my most recent exercise, I learned that a 2TB disk drive that contains critical network shares had stopped replicating months ago. The infrastructure manager told me that the drive was too big and volatile and that it was consuming bandwidth and causing other servers to fail their RPO. Once I got over my urge to scream, I asked what the space threshold was that needed to be reached to be able to turn the replication back on. I then asked him what he could do to reduce disk space. He shrugged and said, "I don't know what is important and what isn't." So, I took the lead and identified junk data and reduced disk space by 60 percent. I should have made him own the task, but instead took the path of least resistance.

Eight, documentation. Very few organizations have it. And those who do have documentation usually have only what is obsolete. The moment it is written down, some detail has changed. Also, what I have learned is that very few people refer to documentation after it is created.

So, there you have it. I have oodles more, but this article is long enough already. I hope you find what I shared useful in some capacity. And remember, when it comes to DR exercises, the devil is in the details.

33 Comments

Yes, it's a daunting task, one that we hope is never called on.  Exactly like a catastrophic accident health insurance policy, you don't dare move forward without it if you're a responsible person.  However, some folks can't afford the same insurance that others can, and this has its parallel here, too.  Smaller businesses (three people?) may not have a very complicated environment; therefore they may not need as much DR savvy and expertise and equipment as MIcrosoft does.  But those same three entrepreneurs MAY have a huge online business, with customers in dozens of countries.  Their DR plan may rely on many outside vendors' services being backed up and restored, being automatically rerouted, doing all manner of different things to keep the lights on should one or two key people be gone--perhaps eaten by Bob's lion!

Level 20

I feel for you Peter!  I can't imagine doing BC/DR for a full time job!

MVP
MVP

Good article, thanks for sharing

Level 12

My biggest thing with DR is documentation. I apply my documentation methods to any documentation I create, and I strive everyone else to do the same, and I feel this especially applies to Disaster Recovery. DR may literally involve a disaster, and your head person who knows the processes and procedures inside and out may be unable to perform their duty due to what ever the disaster is. At this point, you still have to move forward with the DR or the business may completely go under, and this is where the documentation and my mentality towards it come into play. Your documentation should be very thorough and detailed, to the point where someone with absolutely no IT experience what so ever can follow your instructions and complete the task. Some people have told me this is overkill and a waste of time and resources, but I have had this come in very handy at times. Even when I refer to my own documentation I appreciate the fact that I wrote it in a way pretty much anyone can understand, because there are times I have been woken up at horrible hours of the morning that should not exist to fix issues and had to resort to documentation. In these cases it was easy because my brain was not exactly awake yet so I didn't have to put much thought process into it.

I will use our recent weather event we had blanket the midwest region this last weekend. In my area alone we got 25 inches of heavy wet blowing drifting snow in a 55 hour period. There were multiple road closures and several businesses had ceiling collapses. Imagine if the ceiling caved in with 25 inches of snow on your main datacenter and you had to spin up your DR scenario, and your main guy lives on a road that has been shut down due to 8 foot snow drifts on it. Unless your guy owned a snowmobile he won't be getting there any time soon. I couldn't get out of my alley until late Monday morning and was 4 hours late to work, if there was an emergency where I needed to be on site, I likely could not have managed to get in without massive effort, and likely walking it the whole way.

That ended up being longer winded then I planned it to be. Basically do everyone a favor and make your documentation thorough and complete, and make it so someone with no IT experience can follow it from start to finish.

Level 11

I hear you on the importance.  Kudos for you if you can document at such a level that anyone can follow it.  My documentation is typically targeted for peer level.  The few times I've tried to "dumb it down" so to speak, I end up missing steps that I don't even usually have to think about, but someone non-technical would get hung up on.  This generally takes a lot of testing/review of the documentation and in most cases is such a level that I just don't have the time/resources to do so.  I guess that's a reason why Tech Writers exist right?  

Level 12

It is very funny that you mention Tech Writers. I actually picked up this habit and mentality in a Technical Report Writing class I had to take in college once. The class was literally 18 weeks of being given assignments to write reports and instructions on technical stuff. I absolutely hated the class while I was taking it, but now that I have been in IT for a while, I can finally appreciate everything I truly learned from it. I feel everyone in IT should probably take a class like it to learn how to do proper documentation and instructions.

Another thing I do is when I do something that I have instructions written up for, I will from time to time take the instructions and follow it from start to finish exactly as written, and try to find any changes or errors and correct them afterwards. This has helped me find a lot of small things like missed steps, out of order steps, wrong steps, and many other things too. Instructions should be a living document, just like the process that they are written for is, constantly changing and being updated.

Full-time job? HA!  I also manage the SAP Basis, Authorizations Security teams, NetSec, and corporate monitoring. I wish for a full-time job but I love all the other things I do.

Level 13

Good Article

Technical Writing impresses me mightily!  I took a Tech Writing class, and a Business Writing class, in college; I learned that you might have to know something really well to teach it to others, but you must know EVERY bit and permutation of it if you're going to do Technical or Business documentation of the product.

I think of Technical Writing as documenting fully and accurately all there is that must be known about a process or product, whereas I look to Business Writing to extend that Technical Writing detail out to a much broader scope.  Keeping all the technical info, then adding in more to cover the possible tangents and parallel implementations of the original technical document, with the goal of covering all possible legal business use of the product.

That's what Patent Attorneys and Copyright Authors may be expected to do--claim all possible uses and unexpected offshoots of a product as part of the intended and planned purpose of the original idea, for patent use.  In some ways it's unfair, in my view, since someone may see a product or idea and have a better idea, or immediately see an alternate or parallel use for it that was completely not in the mind of the inventor.  Our Patent laws allow that inventor to sometimes own those unimagined offshoots and parallel uses, for profit, despite them not being part of the original idea.  It why Copyrighters and Patent Attorneys are so ridiculously detail oriented on the one side, and so incredibly broad and vague on the other--to claim everything is the property of the original inventory.  It's also why they get paid well and why they pay attention to every comma and proper spelling.   Case in point:  A single missing comma cost a company $5 Million when it got involved in a lawsuit:  Oxford Comma Dispute Is Settled as Maine Drivers Get $5 Million - The New York Times

For an exercise, try writing a Technical description of a thumbtack and make it so complete that anyone could describe and use and build one without ever having seen one.  Then try writing a Business description to patent the creation, manufacture, and use of that tack, writing it broad enough that the patent would prevent anyone from creating and patenting and using a pushpin in place of the thumbtack.

Correct. But these policies should scale in size to fit the need. a SMB is very much at risk and could be wiped out without any plans in place. The risk for smaller companies scare me more. The SpoF's are concentrated.

With regards to documentation how would you consider the rate of change to your IT landscape? At my company I would rate it as volatile. In the past 3 years near 100% of our infrastructure has been refreshed and we've had considerable turnover. The ripple effect is that our documentation is practically obsolete by the time it is put to paper. This is why I am exploring other alternatives like NTA, SAM, NCM, etc. as dynamic documentation alternatives.

Level 20

Let's hope the new network atlas for NPM can draw a little better this next version.

Level 10

Very insightful, thanks. Lots of comments about documentation and I'll add my two cents. Documentation is truly a living, breathing, and every changing thing. Once it's written down and posted, it is usually already obsolete. The hard part is to maintain focus on it and keep it reasonably up to date. As for DR, I can't stress enough it's importance. If you want to stay out of the headlines, you're going to want a solid DR program.

Change here is almost impossible to keep up with, but there are DR plans which are high level, and details that change frequently. The details we try to have Machine generated data that gets audited.

Level 13

In my technical writing class long ago, one assignment was to write a procedure to sharpen a pencil in one of those old wall mounted pencil sharpeners that used to be in classrooms.  The one with the rotary dial for pencil diameter.  The instructor went through every procedure the people in the class wrote and was able to fail every one.  Some didn't set the rotary dial, some didn't say which end of the pencil to insert, some didn't say which way to turn the crank.  It was enlightening.  The jist of the exercise was to show that in technical writing, you may need to assume the audience of the writing has no clue about the technology or details.

Then as I got more into computers, I realized how tough it was to explain things or write steps for my parents to do certain computer tasks.  Sometimes you can write to a more technical audience, but sometimes you need to write as if to a beginner/non-tech person.

Good points about DR. And if I was in a less volatile environment I would push for updated documentation during the CR phase to stress its importance.

Level 14

Details are really important.  I was at a client (and hadn't been there long) when the decided to do a DR failover to the backup site.  I waited for a few days whilst they planned, waiting for someone to point out what I thought was obvious.  Eventually, the day before the planned failover I had to point out that they were already running on the failover site and their plans needed to be revised as they should be doing a failback.  A small point but caused a delay of a week whilst they revised their plans.  And yes, it really was a big difference.

We are in a similar boat. For that I have stressed standardization, consistency, and consideration for DR beginning at the early stages. I often ask the challenge question: "How will that work in DR?" during project kick-off's and such. I am not claiming this to be 100% ironclad... but it keeps the DR conversation going which has helped me a lot.

Level 16

I don't envy the person who has to go in and do this as their job every day. It's hard and stressful work. And kind of like working with an auditor, a lot of people don't buy into the fact that everything needs documentation so its an uphill battle most of the time.

MVP
MVP

Great article with lots of good questions. I've never been an auditor, but I've worked with a lot of them. (in a previous life I worked for a company that provided the fiscal agent side of things for state Medicaid) During that 12 + years I was part of at least 2 and sometimes as many as 4 audits per year. Documentation is stressed in every conversation - "If it's not documented you are not doing it." And so the DR/CP has to be very well documented and cover every scenario that you can think of. It has to be a living document that changes as the landscape changes. That hotel you were planning to use changes hands, moves, closes - update the document. Employees leave or hire on - change the document. You didn't plan for hurricanes (living in Vermont who'd of thunk) - change the plan.

The medical world is wrought with "additional details". The wine & spirits business is not. So I am free of those constraints. But I think we can all appreciate the necessity of those "additional details" as they can save lives. Audits are an absolute necessity in the medical world as it is in the financial world.

MVP
MVP

I echo the sentiments of others in that keeping the documentation up to date is a full time job and without constant data from the responsible parties to include future plans, patching that may cause changes to the DR world, etc. the job is dauntless at best. 

You bring up a very good, and often overlooked, point. SO many DR plans are entirely focused on failing over to DR. SO very few give any consideration on failing back. When data is written to the systems while they are running in DR that data needs to be retained. Here is our example:

Our PRD SAP database is several TB. It takes several days to perform a complete replication to DR of we start fresh. So if/when we fail over to DR and write data to it we then have to reverse replication from DR to the PRD datacenter. That is a several day process before we can consider running put of the PRD datacenter. And like most companies our DR datacenter is perhaps 60-70% of capacity of our PRD datacenter. That is a serious impact to our business.

It depends. I have a lengthy IT background and I am in the IT department with IT responsibilities. SO I am very much in tune with the goings on's and so forth. I can be involved with the day-to-day and still be an auditor. I believe this helps me be effective. I can see if the BCP was on the outside trying to implement a DR strategy how difficult that would be.

I've done similar exercises for tying a shoe and boiling an egg. What these have in common is that these processes are hardened and tried & true. IT is the exact opposite. Technologies do not stand still. This requires SOP's to be constantly updated.

Every DR exercise I am starting at square 2. Usually because at least one IT service has been added or significantly upgraded. And rarely do these services exist in a bubble. The are dependencies aplenty. These need to be rooted out and audited.

Level 13

Don't know how I missed this the first time around.  Great post.  DR is difficult to do, especially if you are the only one really invested.  In fact without buyin all the way to the top I question whether you can really pull it off.  Too often the planners assume that everyone will be available (would you really come in to work if your home had just been flattend by a tornaod/hurricane? - don't think so), that there will be no turnover, everything will proceed as planned, all the backups will be up to date and readable, etc etc.  You're also never really done with this process in a technical environment, because things keep changing.

MVP
MVP

AS a note, every disaster is different and recovery will vary...

Planning for DR needs to incorporate a variety of solutions with documentation on how to recover or at least get back up and running until the real problem can be fixed.

Change in the environment will hamper this as the DR world is behind the curve and not in sync with the real production side so they always have gotchas or unknowns to deal with.

MVP
MVP

DR and BC are difficult to pin down and the worst part is that it never is complete. It has to be a "living" document to stay current with the changes day to day.

Level 16

Several years back I worked at a company that was 300 feet off a major highway. We had a very bad wind storm with a lot of damage. Got called in to work on the generator for the data center and the police had blocked the road and wouldn't let anyone in that wasn't a resident of that city. I wasn't a resident so they wouldn't let me through. Explained I was there to work on the generator and they let me through. Most of the IT staff was turned away.

Tabletop exercises are great for exploring these scenarios. What do you do when you have job abandonment? What is the expectations? Do you have SPoF at key positions?

  Walk it through and then talk it out. Then document your findings and submit to leadership. Schedule a follow-up meeting to review. They will provide direction and determine risk.

Hence my blog title...

   Your DR strategy should not exist inside a vacuum. It should be a consideration for Change Mgmt. Every time there is a change the change is scrutinized from a DR perspective. Regular exercises keeps service owners on their toes. No one likes to fail.  They will do what they can to make sure their services meet RPO/RTO.

You are correct.See my above response as it applies. Another point to add is that you have to get everyone involved. When new technology is been added or existing tech is being modified the same question is repeatedly asked, "Will this work in DR?" This signifies adoption and ownership, which makes me eternally grateful.