1 of 1 people found this helpful
"Unmanage" all the nodes for the expected time window when you will be doing the upgrade (obviously before you start the upgrade).. may want to add some time on the back end as a "pad"... that should do the trick for you...
@rjg5050, thanks for your recommendation!
I agree that this is one of the possible options and we might adopt this strategy during our own upgrades regardless of the negatives.
When un-managing a node, you choose a start and end date & time and Orion will 'stop' monitoring it temporarily. Un-managing a node postpones gathering metrics but the best reason why you want to un-manage a node is that it will not impact a node's availability metrics (NPM) or availability for monitored applications (SAM). This helps to 'skip' monitoring data that would definitely display some downtime during a polling server reboot (i.e. Windows patching) or Orion software upgrade and won't impact availability for an application or node.
Some of my notes about this option:
- Ok let's be honest here. This is the EASY option Very simple to do. Just un-manage everything for the default 24 hours or longer then re-manage manually when the system is fully back online.
- This is a good and viable option for smaller environments (1 or 2 polling engines max).
- Un-managing nodes means you are stopping monitoring on purpose for a period of time and Orion will NOT poll for metric data. This could be an issue for teams or executives that absolutely demand that 24x7 monitoring (ISP's, MSP's, and financial institutions I'm talking to you).
- Might not be possible to un-manage every node/app through the web console for very, very, large Orion environments that have thousands of monitored nodes. In fact the web console might actually timeout before it completes un-manage/re-manage task for a ton of nodes.
- Still need to disable the Alerting Engine and Alerting Services on the primary poller in case of any false positive alerts generated.
- Need to be careful when un-managing nodes/applications in bulk because doing this will overwrite an existing un-managed time period for nodes/applications that might be legitimately un-managed. So before un-managing the nodes in bulk, I recommend first generating a report of all existing un-managed nodes with their start-end times and save this report. Then move forward with
- Availability reports could be impacted by this option. For example, when mass re-managing nodes (such as 1000's of nodes at once) it takes time for the polling engines to 'catch up' with the workload and also will start eating up the TCP ephemeral port ranges (default TCP Time_Wait statuses in Windows releases after 240 seconds) and WILL cause a lot of false downtime. So when re-managing nodes in bulk, instead of re-managing 1000 nodes at a time might need to stagger at 50 at a time, wait 5 minutes, then re-manage 50 more. Wash, rinse, repeat until all are back online.
These are only some of my own notes so there may be more.
1 of 1 people found this helpful
I'd disable alert actions (it's a checkbox on the alerts page for administrators) -- the alerts still show up in the console, and once the upgrade has finished you can re-enable the alert actions
(there might be actions that are missing, but you could clear the instance of an alert, and when it retriggers the actions will fire again.
this way you can see the alerts that would be triggered and 'fix' your install for any that are unusual
Thanks RichardLetts for sharing!
Yes, for solely disabling all email actions during a poller reboot/upgrade, disabling all alert actions is a good option.
There might be some alert actions that should be disabled, depending on the environment, but yes this is one option that would work well. The downside here is if there are hundreds of alert actions that would need to be disabled (I alone have more than 500!) but it works.