First of all, thanks for stopping by to read this blog. I'll try to keep it interesting and provide at least a couple of new posts per week.
 
Today's topic is disaster recovery/failover. As you may have noticed, we added this blog to the community site today at noon but it's now several hours later that I'm providing the first entry. This is because at around 12:05 today the location that I was working from experienced a network outage and by the time that they'd resolved the problem I was already locked into my next set of meetings. Now, I don't know exactly why this happened other than it was carrier related. If I had to guess, I'd say that some newbie in the provisioning department accidentally provisioned part of my circuit to someone else. Either way, it got me to thinking about disaster recovery, protection, and failover solutions.

As I see it, there are three important trends that amplify the need for disaster recovery planning. First, companies are relying more and more on the network as we continue to distribute systems across our WANs and even across public internet connections. Second, we are continually replacing traditional dedicated circuits with MPLS based networks and internet based VPN solutions. Third, because of the continuing demand for network engineering expertise (i.e. you just lost the best guy on your team to a new company) and the growing complexity of today's networks (there's just plain more to learn nowadays), many of the people we rely on to provision and maintain these networks are less qualified than the people doing these jobs 10 years ago.

The most important part of any disaster recovery methodology is simply to have a plan. Not just in your mind - have an actual written down plan. Print off the plan and put it in a white binder with big black letters on the spine and label it "Disaster Recovery Plan". Put a copy on your desk, give your boss a copy, and be sure that if you have a NOC they have a copy as well. I can tell you that even if what's in the binder isn't all that impressive, just simply having a few copies of it around and making sure that it weighs at least as much as a nice notebook will help your career...

Now that you have decided to have a plan the next step is you have to have a budget. Probably the biggest roadblock to acquiring this budget is the word "disaster". When people think about disasters, they envision hurricanes, earthquakes, and fires. Because of this, many people think that the chances of them needing a disaster recovery plan are pretty slim. Let's be clear about something - disasters are much more common than this. If you've only got one circuit leading to a site and that circuit goes down you have a disaster. If your company has only once Exchange server and that server goes down - disaster. If your core router/switch ever goes down guess what - disaster. When you go to start arguing for budget for disaster planning and mitigation be sure to document what can happen if you don't get the budget. Point out some of the outages that have occurred over the last few years and the frustration and/or lost revenues that they caused and this should help to ease things along.

In the next entry we'll talk about some technical best practices with regards to planning for and around these disasters. Be sure to send me your thoughts on this topic if you have some and I'll try to get them included.

Now to pay the bills...
Since this blog is on the SolarWinds community site I'll try to add a little something to most topics to help keep the lights on. With respect to disaster recovery planning, this is also very important to think about when architecting your NMS. There are several ways to make your Orion server fault tolerant and to plan for disaster recovery. One of the easiest ways is to purchase a copy of the Orion Hot Standby Server. This application sits on a physically separate box from your main Orion system and monitors the Orion database to ensure that your data collectors are healthy. If one of them goes down, it automatically picks up the load - impersonating the failed polling engine. If you don't have a copy of the Orion Hot Standby Server, ping your sales dude and have him/her send you a quote.

That's it for tonight.

Flame on...

Josh