cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Before choosing a technical disaster recovery solution

Level 11

Getting right into the technical nitty gritty of the Disaster Recovery (DR) plan is probably my favorite part of the whole process. I mean, as an IT Professional this is our specialty – developing requirements, evaluating solutions, and implementing products. And while this basic process of deploying software and solutions may work great for single task-oriented, department type applications, we will find that in terms of DR there are many more road blocks and challenges that seem to pop up along the way. And if we don’t properly analyze and dissect our existing production environments, or we fail to involve many of the key stakeholders at play, our DR plan will inevitably fail – and failure during a disaster event could be catastrophic to our organizations and, quite possibly, our careers.

So how do we get started?

Before even looking at software and solutions we really should have a solid handle on the requirements and expectations of our key stakeholders. If your organization already has Service Level Agreements (SLA’s) then you are well on your way to completing this first step. However, if you don’t, then you have a lot of work and conversations ahead of you. In terms of disaster recovery, SLA will drive both the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). An RPO essentially dictates the maximum amount of time in which an organization can incur data loss. For instance, if a service has an RPO of 4 hours we would need to ensure that no matter what we can always restore our service with no more than 4 hours of data loss, meaning we would have to ensure that restore points are created on a 4-hour (or smaller) interval. An RTO dictates the amount of time it takes to get our service restored and running after a failure. Thus, an RTO of 4 hours would essentially mean we have 4 hours to get the service up and running after the notification of a failure before we would begin to massively impact our business objectives.

Determining both RTO and RPO can become a very challenging process and really needs to involve all key stakeholders within the business. Our application owners and users will certainly always demand lower RPO and RTO values, however IT departments may inject a bit of realization into the process when a dollar value is placed on meeting those low RPO/RTOs. The point of the exercise though is to really define what’s right for the organization, what can be afforded, and create formal expectations for the organization.

Once our SLA’s, RTOs, and RPOs have been determined then IT can really get started on determining a technical solution to ensure that these requirements can be met. Hopefully we can begin to see the importance of having the expectations set beforehand. For instance, if we had a mission-critical business service with RTO of 10 minutes then we would most likely not rely on a tape backup to protect that service as it would take much longer than that restore from tape, instead, we would most likely implement some form of replication. On the flip side, a file server, or more specifically the data on the file server, may have an RTO of say 10 hours, at which point it could be cost effective to rely on backup to protect this service. My point is, having RTO and RPO set before beginning any technical discovery is key to getting a proper, cost-effective solution.

What else is there to consider?

Ten years ago, we would be pretty much done our preliminary work for a DR plan by simply determining RTO and RPO and could begin investigating solutions – but in today’s modern datacenters that’s simply not the case. We have a lot more at play. What about cloud?  What about SaaS? What about remote workers? Today’s IT deployments don’t just operate within the 4 walls of our datacenters and are most often stretched into all corners of the world – and we need to protect and adhere to our SLA policies no matter where the workload runs. What if Office 365 suddenly had an outage for 3 hours? Is this acceptable to your organization? Do we need to archive the mail somewhere else so at the very least the CEO can get that important message he needed? Same goes with our workloads that may be running in public clouds like Amazon or Azure – we need to ensure that we are doing all we can to protect and restore these workloads.

The upfront work of looking at our environments holistically, determining our SLAs, and developing RTO and RPO’s really do set IT up for success when it comes time to evaluate a technical solution. Quite often we won’t find just one solution that fits our needs – and in most deployments, we will see many different solutions deployed to satisfy a well-built DR plan. We may have one solution that handles backup of cloud and another that handles on-premises workloads. We could also have one solution that replicates to could, and another that moves workloads to our designated DR site. The point being that by focusing most of our time on the development of RPO, RTO, and business practices really lets the organization, and not IT, drive the disaster recovery processes – which in turn lets IT focus on the technical deployment and solutions built around it.

Thus far we have had two posts regarding developing our DR plan which dictate taking a step back and having discussions with our organizations before even beginning to evaluate and implement anything technical. I’d love to hear feedback on this. How do you begin your DR plans? Have you had those conversations with your organization around developing SLA’s? If so, what challenges present themselves? Quite often organization will look to IT for answers that should really be dictated by the business requirements and processes – what are your feelings on this? Leave me a comment below with your thoughts. Thanks for reading!

31 Comments

I just can't imagine retail businesses--especially small ones--having to deal with the degree of complexity to achieve these RPO's and RTO's affordably.

I also can't imagine the complexity of big business and the stock markets having to do the same.

I work in health care, and while we too cannot lose data, somehow it all seems more manageable than imagining being responsible for up time and fast recovery for Wall Street.

Level 13

disasters come in all shapes and sizes and the DR plan needs to cater for as many eventualities as you can think of. But there is also the need to manage that in small chunks otherwise you will never be ready for the disaster when it strikes.

MVP
MVP

Nice article

Level 20

DR/BC is hard and not very easy to really test either.  Good backups is the best place to start I think.

Level 14

A lot of people focus on a complete disaster and the need to recover everything.  I prefer to look at each individual system and plan to recover that.  Take care of the small problems and the big problems will take care of themselves.

We have SLAs, we just ignore them as there are no repercussions.

MVP
MVP

Very good article!!! I am very proud of myself!!   I am finishing a project; I built a new site, and turned our old site into a true DR site!  

You all will laugh, when I brought up the RPO/RTO  discussing with management, I could not keep a focused audience!  I don't know if it was because they did not understand, but the subject turned into, "why can't we go to the cloud?"  I fair question, but what do you want to do in the cloud?  There is a cost to doing business.  I had to convince management to go to a "private" cloud solution (my hardware) to start with, and then we could entertain a true "cloud" discussion!

I have a 15 minute RPO/RTO; unfortunately, you have to start with the "recover in minutes" price tag, so that you can all agree that 15 minutes is most acceptable!!

I am loving my job!  I have been mentored by good people guiding me in the right direction!  I am a success because of the people and resources that I have utilized!  A big thank you to SolarWinds too!  Without visibility into my environment, none of this work would have been possible.  I used SolarWinds to ship shape the environment, getting it ready for the conversion.  I ended up with a total Virtualized environment (we were supposed to finish the virtualization this year, but the opportunity presented itself shortly before we re-located our data.  What an awesome adventure!  I have gauged my success by comments from the users.  The users are complaining about the television, about the carpeting.... not a peep about the technology!!!   That is the greatest success story that I can share!!! The second component... we did not have any events during the conversion that caused the media to show up at our doorsteps.  I am a 911 Call Center that serves up a geographical area called El Paso Texas with a single application!   It has been an educating experience and I am a much better technician and engineer because of the experience. 

Thanks for sharing the information!  I am passing the info to others, as you provided a great starting point for anyone considering DR.

MVP
MVP

Congrats zennifer​ ! 

Of course I have a soft spot for 911 call centers and fire/ems dispatch.

MVP
MVP

Thank you!  Thank you very much!!!  We cut over the end of November, I am just now able to look back and really appreciate the new facility and what all the hard work accomplished.  It was a nightmare with stress and the unknown variables that others would throw in my court!  I am my worst enemy... but I am going to give myself some credit... I hit this one out of the park!!!  (Not by myself), but I put more than my fair share of blood, sweat, and tears into these facilities!  I wish I could publish pictures!

I do greatly appreciate the kind words Jfrazier

MVP
MVP

data.bmp

Absolutely. We have plans per system, and those systems get ranked for importance. From there we review which have built in HA by application design, and what needs some level of assistance (automation) to meet the business needs for recovery.

Level 12

Totally agree, as this is the same thing we did in my previous company. Individual system recovery is the key, and makes it all easier. You just need a good plan to line all those systems up according to priority, so an annual/bi-annual DR simulation is always helpful. And fun!

Level 14

Yes, we have every system ranked and all the dependencies within systems also ranked so we know which servers need to be started first.  We also know which ones are vital to the organisation and which ones are important but can be done without for a day or two.  With this information we can also arrange patching cycles and backup schedules so it all ties together really well.  If they approve the £2M spend we will also soon have our on site data centres acting as active - active effectively giving us mirrored data centres and high availability on everything.  After that we will look at our backup system which is no longer fit for purpose.  The Solarwinds Backup solution does look very interesting.

https://thwack.solarwinds.com/community/solarwinds-community/product-blog/blog/2018/02/06/solarwinds...

I am so envious! Our datacenter is the equivalent of that house in the neighborhood that has the plaid couch with the broken leg on the front porch.

MVP
MVP

From the beginning I've been a big proponent of good documentation. To me it's much like insurance - you put a lot of time and money into it and hope that you never need it. A good DR plan and facilities are critical to most businesses - if the need arises. But alas, just like insurance many businesses will "go without" until a disaster strikes and then they will prepare for the next time. The problem with that is often after a disaster, if there is no plan in advanced, then company may not be there after the disaster due to lost resources, lost customers, lost reputation, etc.

MVP
MVP

you have a front porch ?

MVP
MVP

I am 90 % done with my run book.  I absolutely had to do this, because no one worked with me while I executed the technology.  I am getting ready to take a 23 day vacation in April; I WILL GO WITH A PIECE OF MIND.  I have good documentation, backed up with vendor support numbers, serial numbers so that my co-workers can phone a friend.   Every detail that you can add to your documentation will only benefit others! 

I have written so much documentation and done so many diagrams, I have often considered entering the world of technical writing.  I actually enjoy documenting, it is weird, I think I do it because I am going to forget something ... I think I use my own documentation more than others; they will have to fumble thru it in April!!!!

MVP
MVP

What are you using to build your runbook? Do you have a good way of extracting your SolarWinds data for such?

MVP
MVP

Old fashioned, and yes, exporting a generous amount of information from solarwinds to add to the document.

Word and Visio - I will put everything into a PDF format when I am done.

As I built the site, I used Excel (and still have the massive book) to break down the different subnets and services.

I have Brocade MLX, and ICX Switching, FiberChannel, and VDX switch for the data center.

Dell M1000e chassis with 13 M630 blades

NetApp FAS8040 with 60 Tb storage - Tier I, II, and III; installed first all flash array (hybrid- as it is a NetApp - used for the desktops)

2 Physical Domain Controllers

Access Control System

Camera System

TXDoT

P25 Radio System

Avaya VoIP System

Vesta 911 Phone System

UPS

Generator

Migrating my Zetron paging system to American Messaging (hardware that I will be happy to see go)

I am leveraging VEEAM and the NetApp for replication between my sites via a 10 Gbps pipe.

Absolutely every piece of hardware that I can currently get to (I still need to install a couple more firewalls) is being monitored (and maintained) by solarwinds.  I use NPM, IPAM, SAM, VMAN, Config MAN, Netflow, DPM, Help Desk, and DameWare!!! How cools is that!!!

I spend a lot of time with the tools and applications to ensure that we can be proactive rather than reactive, and if something happens, we can recover.

As you can see, there are so many components to manage.  Each has its own section in the run book.  I have it documented down to the "red button" push... you know why... cause early on, before we had any people working here... I pushed the RED button!  It was definitely not the easy button... at that time, I got training on the electrical transfer switch that we have on site from the electric company!

If you can think of the situation, document the heck out of it!

Each one of the service above can have 1000 potential failures.  During the testing of infrastructure is another time to document - every move you make, every breath you take - ... as my friend Jay from Modern Family says to Manny ... "write this down"

You can never have enough documentation.

Level 11

Absolutely.  Breaking things down into small achievable tasks always seems to make things easier!  Thanks for the comment!

Level 11

Thanks for reading!!

Level 11

Absolutely!  At the yearly least having the piece of mind that you can restore your data is important!

Level 11

These are all great comments with great advice.  I for one find that by breaking down your org by service also helps you to discover those little scripts and tidbits that need to be recovered along side the services!

MVP
MVP

Don't forget to have a 3 AM friendly set of top ten issues (most common) and the solutions/steps available in a easily transportable format
.

use simple text files where possible since they can be read on any platform.
I usually have all my cheat sheets that way so I can pull them up anywhere.

You mean your datacenter doesn't?? How do you allow the fresh air in?

MVP
MVP

that only happens when one of the kids leaves the front door open.

MVP
MVP

Who has never had to open the door and vent the room!!!???  

Why just last week, in this brand new facility, we experience an environmental issue called HEAT!

Unfortunately, I have a portable robot (a B-9 Class M-3 General Utility Non-Theorizing Environmental Control Robot)

pastedImage_2.png

I had learned from my past experience .. (let me recommend to you all Avtech RoomAlerts AVTECH Server, Room monitoring security, Room Alert, Environmental Monitors, Humidity Sensor, IT Sec...)

I purchased and installed the unit in the data center well before we commissioned any equipment; we have experienced several failures since then........  I am alerted via e-mail when the system hits the threshold I set, which is 80 degrees.    Unfortunately, we have ... nowhere ever seen before, the latest and greatest in environmental chiller systems.... it was great for about 10 days this winter when the temperature outside totally cooled the water, which cooled my data center... I would rather have a system that is reliable!!! (do I sound negative....???)

MVP
MVP

Year there was the one time at a prior job when they were working on the chillers....  Everything was shut down except for the clustered system that handled POS transactions.  This was during the summer in Texas.  We had the doors open and fans running...we got to within 2 degrees of the point where we would have been forced shut those systems down.  It wa pretty much a miserable shift as I was the operator on duty that weekend day.  The only other person on the floor was the security guard outside.  Most all the lights were out to reduce heat so you only had one set of safety lights in the general corner of the room to utilize. 

MVP
MVP

We use those and I have posted a custom poller for them so that I can monitor them via SolarWinds. It seems redundant to pay for their services when we are already using SolarWinds and can monitor and alert from there.

Level 14

I am 80% done with a 2+ year complete top to bottom network redesign and DR capabilities. Each step carefully documented for safe keeping and the obligatory 3 AM " oh ****" call. The primary reason for the uber documentation, it was lacking or inadequate and did not survive the most destructive of disasters, the untimely death of the person who knew it best. It hasnt always been pretty but we are better for having gone through the process. Reality can be a hard task master.

Level 11

Thanks for the kind words and congrats on getting the site off the ground!  I hear you on the "discussion" piece.  It's the hardest part trying to get non IT related people to commit to times and numbers - especially when dollars are involved!  Service Level Agreements turn into something called Service Level Expectations

Level 21

Another aspect of BC/DR that is often overlooked is the people.  If a disaster strikes in the area where your primary datacenter is (say a big earthquake) and that is also where your people are, who is going to do the work to restore stuff as most BC/DR plans are not fully automated.