This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SolarWinds HA Active/Active Solution

So I'm fully aware of the Orion Failover Engine however that doesn't suit our needs in this scenario.  What I'm wondering is are there any large enterprise customers out there running an Active/Active HA solution for SolarWinds.  I am needing a scenario where we can do maintenance and upgrades without outages (this is being dictated by upper management).  The only way I can see to do this is to setup two different pairs of systems. 

My questions are if anyone is currently doing this?

And if so how are you maintaining the data (replication, independently configuring each system)?

I'm well aware this issue won't be felt by many customers but in our organization SolarWinds is the Manager of Managers and in addition to our polling we take in alerts from several other systems and quite frankly we have to have a fully functional system at all times.

Thanks in advance.

  • I TOTALLY understand you on this....  We have to test everything before we do it AND they prefer and almost require zero down time for Orion (but with little money).  So we came up with a cheaper solution (still not funded....).

    We will have to buy licenses in duplicate (cost becomes excessive, especially maintenance on a dev\backup system).  We then have the production DB backed up every 12 hours and copy it over the "Dev" DB, hence providing an almost exact copy of the system.

    That was my idea....  Kinda klugy...  but I can't afford much at the moment.

    What we have in place today is:  Production: NPM, SAM, NTA, IPAM, NCM, UDT, VNQM all unlimited license.

           Development: NPM 2000, SAM 700....  thats it....

    So we can use that system when we do upgrades and we have just the critical items on the Dev system....

    Kinda lame.....

    We own failover...  but we arn't even using it...  not enough upside when all additional pollers are VM....

    So I agree....  an HA solution that will allow upgrades with zero downtime needs to be on the roadmap....

  • So does the Dev system have the same naming convention as your prod system?  I assume no so how do you take on the changes after replication.

  • wow....   thanks for ruining my whole idea....  :-)

    But your right, that will break the whole thing....  Back to the drawingboard....

    Thats what happens when your too busy to think.....  I make bad ideas....   (can I delete that and act like I never thought of it  :-) )

  • No you can't delete it because this is exactly what I need is an active discussion to try and come up with a solution emoticons_cool.png.  Wonder if it would be possible to have some kind of stored procedure to update the engine IDs before inserting the replicated data to "trick" it into thinking it's their own nodes.

    Also in your scenario the other issue you will have is how to handle the fact you are importing more nodes than you are licensed for. 

  • This is my idea of how it could be accomplished - Not that this is supported - but it could be how they could move to it.

    Things that need to be supported:

    A way to promote a secondary poller to a Main

    Be able to still function while on mis-matched versions during the upgrade process.  If we upgrade in order (10.0 -> 10.1 -> 10.2 -> 10.3 -> 10.4) we should be able to function.  What ever the order they want to support - Main then pollers or pollers then main - backward compatibility for at least one version needs to be there.

    Now (other than database you would have an environment that can have close to no downtime)

    If you need to do maintenance on:

      one poller, you can just move all the nodes to another poller, do the maintenance and re-balance.

      the web page, you can have the users use a cname that points to the server running the page and use DNS to change where users are pointing.

      The main poller, you would promote a secondary to the main

    It seams to me the biggest issue is the database... there is only one, how do you maintenance that?

  • In regards to the DB that is a big question.  Every repair or upgrade requires the config wizard to "talk" with the DB.  Not sure how you could get around that.

    I definitely like the idea of being able to repair a poller independently that would be great.  The big problem right now is if poller 1 is in trouble everything is down to do a repair.

    The reason I'm looking into this is that we are a large enterprise customer with every module installed for Orion with the one exception of NTA.  A new version/patch/hotfix comes out every month for one of these and the amount of downtime has been noticed and is becoming a problem.

  • The only thing I don't like about moving nodes from poller to poller is I have my Production/Test/Dev environment separated by pollers and have adjusted the polling rates based on my needs in each environment.  As of right now you can't set polling rates per poller, so this was a lot of custom work to complete.

    That being said we should be able to adjust rates of polling at on the said poller and not on a global level.

    After that we go back to the DB, and the fact that everytime you upgrade the tables are being written to with changes and require no active data to be written from SW at that time.  Great discussion though because I always get that question from management.  So.. What's monitoring SolarWinds?? emoticons_plain.png

    Question, is the FOE application level fail over, and does it function like I described below if you set it up this way?  .

    So in this case you have 2 instances of SW 1 Master and 1 Slave same with the DBS 1 Master 1 Slave.  The master is replicating to the slave on any changes, and the slave sits dormant, but is sending keepalives to the master to watch availability.  This would allow you to kick on the Slave (already up to date with replication and write to the slave DB).  Then you upgrade the Master along with Master DB.  Push a replication to the Master and Master DB of any changes.  Flip the Master back on and repeat to the Slave then force a new replication.  BAM!! DISCO!