4 Replies Latest reply on Aug 10, 2016 8:51 PM by joedissmeyer

    Question on recommendations for permanently disabling email alerts during an upgrade

    joedissmeyer

      Good day everyone!

       

      My organization has a 'large' installation of Orion NPM & SAM so we have to perform upgrades on multiple poller and web console servers.

       

      Our upgrade takes a minimum of 4 hours from start to finish (even with NPM v12, the config wizard still takes a good 10mins to complete per execution). And with the upgrade we aren't only installing the full package for both products but also the available hotfixes.

       

      Every time you install a hotfix the config wizard needs to be run as well. And every time the config wizard is executed, it usually starts up the Alert Engine and Alert Service. Our team goes in and manually disables the service in the MMC console, but there are many cases where the config wizard completely re-installs the service and starts it up automatically.

       

      My team needs to permanently stop all email alerts from Orion during the entire upgrade process then manually enable email alerts only when we are ready to do so.

       

      So here is my question:

       

      Without blacklisting our poller servers from SMTP relays, or blocking SMTP outbound in the Windows Firewall, what are your recommendations or best practices regarding this need?

        • Re: Question on recommendations for permanently disabling email alerts during an upgrade
          rjg5050

          "Unmanage" all the nodes for the expected time window when you will be doing the upgrade (obviously before you start the upgrade).. may want to add some time on the back end as a "pad"... that should do the trick for you...

          1 of 1 people found this helpful
            • Re: Question on recommendations for permanently disabling email alerts during an upgrade
              joedissmeyer

              @rjg5050, thanks for your recommendation!

               

              I agree that this is one of the possible options and we might adopt this strategy during our own upgrades regardless of the negatives.

              When un-managing a node, you choose a start and end date & time and Orion will 'stop' monitoring it temporarily. Un-managing a node postpones gathering metrics but the best reason why you want to un-manage a node is that it will not impact a node's availability metrics (NPM) or availability for monitored applications (SAM). This helps to 'skip' monitoring data that would definitely display some downtime during a polling server reboot (i.e. Windows patching) or Orion software upgrade and won't impact availability for an application or node.

               

              Some of my notes about this option:

              • PROS:
                • Ok let's be honest here. This is the EASY option Very simple to do. Just un-manage everything for the default 24 hours or longer then re-manage manually when the system is fully back online.
                • This is a good and viable option for smaller environments (1 or 2 polling engines max).
              • CONS:
                • Un-managing nodes means you are stopping monitoring on purpose for a period of time and Orion will NOT poll for metric data. This could be an issue for teams or executives that absolutely demand that 24x7 monitoring (ISP's, MSP's, and financial institutions I'm talking to you).
                • Might not be possible to un-manage every node/app through the web console for very, very, large Orion environments that have thousands of monitored nodes. In fact the web console might actually timeout before it completes un-manage/re-manage task for a ton of nodes.
                • Still need to disable the Alerting Engine and Alerting Services on the primary poller in case of any false positive alerts generated.
              • GOTCHAs
                • Need to be careful when un-managing nodes/applications in bulk because doing this will overwrite an existing un-managed time period for nodes/applications that might be legitimately un-managed. So before un-managing the nodes in bulk, I recommend first generating a report of all existing un-managed nodes with their start-end times and save this report. Then move forward with
                • Availability reports could be impacted by this option. For example, when mass re-managing nodes (such as 1000's of nodes at once) it takes time for the polling engines to 'catch up' with the workload and also will start eating up the TCP ephemeral port ranges (default TCP Time_Wait statuses in Windows releases after 240 seconds) and WILL cause a lot of false downtime. So when re-managing nodes in bulk, instead of re-managing 1000 nodes at a time might need to stagger at 50 at a time, wait 5 minutes, then re-manage 50 more. Wash, rinse, repeat until all are back online.

               

              These are only some of my own notes so there may be more.

            • Re: Question on recommendations for permanently disabling email alerts during an upgrade
              RichardLetts

              I'd disable alert actions (it's a checkbox on the alerts page for administrators) -- the alerts still show up in the console, and once the upgrade has finished you can re-enable the alert actions

              (there might be actions that are missing, but you could clear the instance of an alert, and when it retriggers the actions will fire again.

               

              this way you can see the alerts that would be triggered and 'fix' your install for any that are unusual

              1 of 1 people found this helpful