11 Replies Latest reply on Jan 25, 2013 4:03 AM by lukee_cz

    Failover Engine Question on Upgrades

    mdriskell

      We are looking at implementing the Orion Failover Engine within our environment to give us a HA solution.  The question arose regarding upgrades.  With this solution will I be able to upgrade for example NPM on my backup nodes and then cutover to them and then do the primary nodes?

        • Re: Failover Engine Question on Upgrades
          pacetti

          Mike,

           

          Provided that all your primary and additional polling engines are updated to the same version of NPM, the Failover Engine (FoE) upgrade will follow. FoE takes care of the duplication from primary to secondary servers, so you need only ensure that your primary NPM servers are upgraded.

           

          HTH,

          Andrew

            • Re: Failover Engine Question on Upgrades
              mdriskell

              Ok that clarifies it for me but doesn't solve my problem.  I was looking for a scenario where I could upgrade without causing an outage.

                • Re: Failover Engine Question on Upgrades
                  Richard Nicholson

                  I really can't see how this would be possible.  Losing information sucks, but try and see if you can hit a maintenance window that has little to no activity on your network, so when you do take it offline you aren't missing vital information.  Once polling starts again your alerting engine will see any changes in up/down status and send out the proper alerts.

                   

                  I was going through this issue as well and nothing seemed to be a good idea to do to stop from seeing a total outage.  Moving nodes between pollers is cumbersome IMO for just this type of work and you also will be writing data to SQL over and over again that you really don't need and can just cause headaches.  More so since an upgrade in Orion hits the DB while it upgrades to check tables.. I'm not sure how this would react while you are still doing read/writes to it from another poller.

                    • Re: Failover Engine Question on Upgrades
                      mdriskell

                      We are an enterprise customer so no window works great.  I don't believe in doing too much change at once so I typically upgrade modules individually.  With us using everything but NTA that means about every month I need an outage to do some type of upgrade.  I'm talking a fail-over engine not additional polling engines so in theory there would be no need to move nodes.

                       

                      As for your comment about once the polling starts again we will get proper alerts is correct with one major exception.  We heavily rely on SNMP traps.  Those messages are typically one time events so if the monitoring system is down when it's sent it is lost forever (this happened during my most recent upgrade which is what started these conversations to begin with).

                        • Re: Failover Engine Question on Upgrades
                          Richard Nicholson

                          I can understand that, but I as well am an Enterprise customer and we have a Change Control maintenance window for all our Infrastructure.  How do you not have one for server maintenance for all your production/test/dev systems?  I know you don't just take them down at anytime for maintenance?  Surely not in an enterprise environment.

                           

                          Also, even using a fail over engine you still have the issue of it writing to the DB when you are supposed to have all DB transactions to NetPerfMon stopped.  Unless you are planning on standing up a secondary DB.

                           

                          You would then need to point/add your traps to another IP address to keep those coming in.  Why not just stand up a secondary Trap system that is separate from SW.  I know you still lose your SNMP/ICMP polling for metrics, but you wouldn't lose viability to your traps and still have the ability to alert upon them.  Seems like a lot of extra resources and money to spend on a true SW Fail-Over.

                            • Re: Failover Engine Question on Upgrades
                              mdriskell

                              We do have windows but the problem is every upgrade we are talking 2 hours because of multiple polling engines.  I'm simply looking for away to minimize outages.  In most HA scenarios you can upgrade one while the other is active.  I was asked today if that would be possible with this software.  We already have the HA licenses I'm standing up the failover in our DR datacenter.  I simply was wanting to know if it could be used to upgrade one set while the other was active.  It appears that's not possible so we will just have to deal with the outages. 

                                • Re: Failover Engine Question on Upgrades
                                  Richard Nicholson

                                  I see.. This would be awesome if it could.  I see your point.  Wasn't looking at it that way since my DR site isn't back monitoring the HQ only the link between them and all the systems at the DR.  With HA scenarios getting more popular each passing day I bet we will see this at some point.

                                   

                                  Also, having to do 1 poller at a time is cumbersome as well.  I feel the pain!

                        • Re: Failover Engine Question on Upgrades
                          JC_

                          Hi Andrew,

                           

                          Do you know of many people that have used the method you're reffering to - In a FoE environment to only install/upgrade the active node as the FoE will replicate the changes to the passive node?  I'm asking because some time ago we upgraded our version of NPM and NCM on the active primary node but the changes weren't replicated to the passive secondary node.

                           

                          We're now in a position where the primary server's apps are a different version to the secondary servers. We haven't failed over to test as were not sure what will happen.

                           

                          Any ideas how we get ourselves out of this situation? Can we just failover and upgrade the secondary node (not sure what would happen considering the database was already upgraded from the primary node upgrade)? Or do we have to reinstall all of the apps including FoE on the Secondary server?

                           

                          I'm hoping you can help as the FoE documentation has me going around in circles.

                           

                          Thanks,

                           

                          JC