Orion System Fault Tolerance and Disaster Recovery

A week or so ago I wrote about the need for disaster recovery/preparedness within the enterprise. Seems like there is a lot of interest in more information on this subject - especially with regard to Orion - so I wanted to add a little more content before completely moving on to something else..

First, with regards to disaster planning, I'm going to broadly generalize this into two main strategy types: multi-site and same-site. In a multi-site solution, a company will usually establish a secondary or backup data center which hosts hosts the redundant systems. Usually in these cases, you are preparing for a site outage vs. a system or application outage.

In a same-site scenario, your primary and secondary systems physically reside under the same roof and your goal is to provide application or system level disaster protection. Tonight we're going to discuss the same-site scenario and within the next few days we'll discuss options for multi-site scenarios.

In a same-site scenario, a few things can generally be assumed. First, because you're planning for a system outage, it is assumed that this system being down is not necessarily indicative of a larger problem. Therefore, the secondary system needs to be able to scale to the same level as the primary system. Second, there is a much higher likelihood of this sort of system actually being used (possibly often) since something as simple as a bad hard drive or Windows update may cause your primary system to be temporarily unavailable. Third, the switch over from primary to secondary and back to primary needs to be automated or at least quick.

When planning for Orion system redundancy, it helps to think of Orion in terms of the major components that make-up the system. The main components would be:

  • The website
  • The database
  • The polling based services (main poller, custom MIB poller, Application Monitor, etc)
  • The receiver based services (SNMP traps, Syslog, NetFlow, etc)

Website
The Orion website operates as a standard, IIS based website. So, when thinking about how to provide redundancy for this component think about it like you would any other website. If you want to do it on the cheap you can provide some basic redundancy through creative DNS creativity or the best way is probably to front-end the website with an appliance built for this type of role. Also, in terms of what you need from SolarWinds, in addition to the licensing you need for your primary server you'll want to buy a copy of the "Orion Additional Website" to use on the redundant web server.

Database
Orion uses a standard, Microsoft SQL Server database. So, SQL clustering is the way to go in terms of providing database server redundancy. Lots of our customers are utilizing this strategy today with great results.

Polling Services
This is where it gets a little complicated. First, let me admit that our solution in this area is not as comprehensive as we'd like. Let me assure you that we're working on this and you should expect to read more about this in the future as Joel Dolisy our Chief Architect will be doing some guest-blogging for us on this subject.

With regards to the main Orion polling engine (which includes the basic alert engine), the Orion Hot Standby application meets the need for application redundancy/fault tolerance in this area. The Orion Hot Standby can actively "monitor" any number of Orion systems/pollers and take over if one of them goes down. However, remember that it can only actively impersonate one polling engine at a time, so if you have 2 polling engines down and only one Hot Standby you'll lose some data.

With regards to the Advanced Alert Engine, Custom Poller, Application Monitor, and VoIP module - for most customers, this isn't an issue. The main concern for most customers is that if the Orion server goes down they still need to be notified about critical system outages and errors so the Hot Standby solution provides all the functionality that is required during a system outage situation.

However, if it is critical to have complete system redundancy, there is a solution. There are a few ways to implement this solution, so I'll stick with my favorite. In this scenario, what you want to do is build the Orion server just the way that you want it (we're just talking about application server - not necessarily the database server or web server). Then, virtualize the server and host the VM on a separate physical machine. In the event that the primary Orion server is down - simply start-up the VM and away you go. As long as the VM operates under the same machine name as the original polling server you should have no issues. You can either do this manually or setup an automated script to do this for you. You will need a second copy of Orion and any application modules that you use for this scenario, but if you talk to you salesperson about the way that it's being used they'll probably cut you a deal.

Receiver Based Services
For receiver based services, you're basically left with the same solution as above where you have a second server (physical or virtual) standing by to be started-up if the primary system is unavailable. One key difference though - you'll have to add the IP address of the second server as a destination for traps, Syslog messages, and NetFlow exports.

If this thread gets too much longer nobody will read it all so I'll end here. If you want more information or have specific questions post a comment or drop me an e-mail and we'll get you the information. We'll talk more about multi-site solutions in the next couple of days.

 

Flame on...
Josh

Thwack - Symbolize TM, R, and C