Welcome & Let's talk a bit about disaster recovery...

First of all, thanks for stopping by to read this blog. I'll try to keep it interesting and provide at least a couple of new posts per week.
 
Today's topic is disaster recovery/failover. As you may have noticed, we added this blog to the community site today at noon but it's now several hours later that I'm providing the first entry. This is because at around 12:05 today the location that I was working from experienced a network outage and by the time that they'd resolved the problem I was already locked into my next set of meetings. Now, I don't know exactly why this happened other than it was carrier related. If I had to guess, I'd say that some newbie in the provisioning department accidentally provisioned part of my circuit to someone else. Either way, it got me to thinking about disaster recovery, protection, and failover solutions.

As I see it, there are three important trends that amplify the need for disaster recovery planning. First, companies are relying more and more on the network as we continue to distribute systems across our WANs and even across public internet connections. Second, we are continually replacing traditional dedicated circuits with MPLS based networks and internet based VPN solutions. Third, because of the continuing demand for network engineering expertise (i.e. you just lost the best guy on your team to a new company) and the growing complexity of today's networks (there's just plain more to learn nowadays), many of the people we rely on to provision and maintain these networks are less qualified than the people doing these jobs 10 years ago.

The most important part of any disaster recovery methodology is simply to have a plan. Not just in your mind - have an actual written down plan. Print off the plan and put it in a white binder with big black letters on the spine and label it "Disaster Recovery Plan". Put a copy on your desk, give your boss a copy, and be sure that if you have a NOC they have a copy as well. I can tell you that even if what's in the binder isn't all that impressive, just simply having a few copies of it around and making sure that it weighs at least as much as a nice notebook will help your career...

Now that you have decided to have a plan the next step is you have to have a budget. Probably the biggest roadblock to acquiring this budget is the word "disaster". When people think about disasters, they envision hurricanes, earthquakes, and fires. Because of this, many people think that the chances of them needing a disaster recovery plan are pretty slim. Let's be clear about something - disasters are much more common than this. If you've only got one circuit leading to a site and that circuit goes down you have a disaster. If your company has only once Exchange server and that server goes down - disaster. If your core router/switch ever goes down guess what - disaster. When you go to start arguing for budget for disaster planning and mitigation be sure to document what can happen if you don't get the budget. Point out some of the outages that have occurred over the last few years and the frustration and/or lost revenues that they caused and this should help to ease things along.

In the next entry we'll talk about some technical best practices with regards to planning for and around these disasters. Be sure to send me your thoughts on this topic if you have some and I'll try to get them included.

Now to pay the bills...
Since this blog is on the SolarWinds community site I'll try to add a little something to most topics to help keep the lights on. With respect to disaster recovery planning, this is also very important to think about when architecting your NMS. There are several ways to make your Orion server fault tolerant and to plan for disaster recovery. One of the easiest ways is to purchase a copy of the Orion Hot Standby Server. This application sits on a physically separate box from your main Orion system and monitors the Orion database to ensure that your data collectors are healthy. If one of them goes down, it automatically picks up the load - impersonating the failed polling engine. If you don't have a copy of the Orion Hot Standby Server, ping your sales dude and have him/her send you a quote.

That's it for tonight.

Flame on...

Josh 

  • Great things to think about when it comes to recovery!

  • I have to admin DR and business continuity aren't my favorit topics when I hear them and I'm a CISSP.

  • Funny to see how DR has changed in Orion since then, looking forward to future improvements

  • In my environment part of DR includes implementing diverse LAN and WAN points of passage through my campus, due to the very real possibility of backhoe fade AND helicopter crashes.  Yes, multiple helicopters pass over my campus sky walks every day, sometimes passing directly over head only a hundred feet above them.

    So, the somewhat lower-cost and easier path for Distribution and Access connections happens to be my sky walks.  But I have to have a separate path under the sidewalks for the second leg to all my switches, connected via port-channels to Distribution VSS blocks.  Which in turn go out via 40 Gb links to multiple data center Cores in geographically diverse locations.  Each with diverse entry points.

    It's easy to test these links and test the failover LACP solutions--it's hitless to the end users.

    But testing DR at the server and SAN and spinning disk levels becomes an interesting challenge.  And it doesn't come cheap.

    It's been said many times, but it bears repeating:  Your backups are as good as your last restores.  Or to paraphrase, your DR may be as good as your resilience budget allows, but until you can schedule regular tests (monthly?  quarterly?), you've got no assurance your solution is not missing one or more key pieces.

    Design, build, and Test! 

Thwack - Symbolize TM, R, and C