cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Orion System Fault Tolerance and Disaster Recovery

Level 12

A week or so ago I wrote about the need for disaster recovery/preparedness within the enterprise. Seems like there is a lot of interest in more information on this subject - especially with regard to Orion - so I wanted to add a little more content before completely moving on to something else..

First, with regards to disaster planning, I'm going to broadly generalize this into two main strategy types: multi-site and same-site. In a multi-site solution, a company will usually establish a secondary or backup data center which hosts hosts the redundant systems. Usually in these cases, you are preparing for a site outage vs. a system or application outage.

In a same-site scenario, your primary and secondary systems physically reside under the same roof and your goal is to provide application or system level disaster protection. Tonight we're going to discuss the same-site scenario and within the next few days we'll discuss options for multi-site scenarios.

In a same-site scenario, a few things can generally be assumed. First, because you're planning for a system outage, it is assumed that this system being down is not necessarily indicative of a larger problem. Therefore, the secondary system needs to be able to scale to the same level as the primary system. Second, there is a much higher likelihood of this sort of system actually being used (possibly often) since something as simple as a bad hard drive or Windows update may cause your primary system to be temporarily unavailable. Third, the switch over from primary to secondary and back to primary needs to be automated or at least quick.

When planning for Orion system redundancy, it helps to think of Orion in terms of the major components that make-up the system. The main components would be:

  • The website
  • The database
  • The polling based services (main poller, custom MIB poller, Application Monitor, etc)
  • The receiver based services (SNMP traps, Syslog, NetFlow, etc)

Website
The Orion website operates as a standard, IIS based website. So, when thinking about how to provide redundancy for this component think about it like you would any other website. If you want to do it on the cheap you can provide some basic redundancy through creative DNS creativity or the best way is probably to front-end the website with an appliance built for this type of role. Also, in terms of what you need from SolarWinds, in addition to the licensing you need for your primary server you'll want to buy a copy of the "Orion Additional Website" to use on the redundant web server.

Database
Orion uses a standard, Microsoft SQL Server database. So, SQL clustering is the way to go in terms of providing database server redundancy. Lots of our customers are utilizing this strategy today with great results.

Polling Services
This is where it gets a little complicated. First, let me admit that our solution in this area is not as comprehensive as we'd like. Let me assure you that we're working on this and you should expect to read more about this in the future as Joel Dolisy our Chief Architect will be doing some guest-blogging for us on this subject.

With regards to the main Orion polling engine (which includes the basic alert engine), the Orion Hot Standby application meets the need for application redundancy/fault tolerance in this area. The Orion Hot Standby can actively "monitor" any number of Orion systems/pollers and take over if one of them goes down. However, remember that it can only actively impersonate one polling engine at a time, so if you have 2 polling engines down and only one Hot Standby you'll lose some data.

With regards to the Advanced Alert Engine, Custom Poller, Application Monitor, and VoIP module - for most customers, this isn't an issue. The main concern for most customers is that if the Orion server goes down they still need to be notified about critical system outages and errors so the Hot Standby solution provides all the functionality that is required during a system outage situation.

However, if it is critical to have complete system redundancy, there is a solution. There are a few ways to implement this solution, so I'll stick with my favorite. In this scenario, what you want to do is build the Orion server just the way that you want it (we're just talking about application server - not necessarily the database server or web server). Then, virtualize the server and host the VM on a separate physical machine. In the event that the primary Orion server is down - simply start-up the VM and away you go. As long as the VM operates under the same machine name as the original polling server you should have no issues. You can either do this manually or setup an automated script to do this for you. You will need a second copy of Orion and any application modules that you use for this scenario, but if you talk to you salesperson about the way that it's being used they'll probably cut you a deal.

Receiver Based Services
For receiver based services, you're basically left with the same solution as above where you have a second server (physical or virtual) standing by to be started-up if the primary system is unavailable. One key difference though - you'll have to add the IP address of the second server as a destination for traps, Syslog messages, and NetFlow exports.

If this thread gets too much longer nobody will read it all so I'll end here. If you want more information or have specific questions post a comment or drop me an e-mail and we'll get you the information. We'll talk more about multi-site solutions in the next couple of days.

 

Flame on...
Josh

10 Comments
Level 9


Does anyone know what is required to backup on the polling nodes, and NPM NCM node? Do you need to back it up? In terms of the SQL database i know i want my database using full recovery method. I can't seem to find any documentation on what needs to be backed up on the other nodes though.

Level 15

Something I have not directly thought about.  What would I do if the SW failed?  We run on a physical server to ensure we can maintain monitoring in the event of a virtual environment failure.  Maybe I need to think about the hardware and how to keep it backed up.  Thanks!

It's a good article.  My environment (25 hospitals, 75 clinics, on-call 7x24) we have VM versions of NPM primary and polling, and resilient SQL.  NPM isn't yet mission-critical, but it's critical to my team being able to efficiently triage an outage, know its scope, and understand where to focus our attention to recover the most systems first.

But a hot standby and a full resilient/redundant NPM solution isn't on the books here yet.  I bet it is in larger organizations, though.

MVP
MVP

Since this was posted Solarwinds changed their fail over architecture quite a bit, instead of the hot standby now they have the Failover Engine which is a tailored version of Neverfail

Level 15

I like this article, In my mid the better is to have a sistem replications, whe happenes desaster the system no stop

MVP
MVP

Same here. I've never thought about the Solarwinds server failing. It still runs on a physical box......

MVP
MVP

I currently use a physical server with Orion loaded on C drive. It has remote storage 😧 drive which has the NCM database on it. And the whole thing connects to a SQL cluster.

What's the best way to go to make this a virtual server? I'm happy for no redundancy if the server is virtual. But I'm not sure what's the best way to go about it.

Thanks.

MVP
MVP

Ok I just realised that this was posted all the way back in 2007

Level 20

vitualizing it all has been saving grace for me...

Level 12

thanks for the post