This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Orion shutdown & startup - avoiding warning/critical states for high packet loss?

Hey everyone!

The march of getting production set up continues on. We had to do some patching of our database last night, so I shutdown the Orion app server and left it that way until this morning.

Upon restarting the database and then the app server / polling engine, most of our agent-based Windows and Linux machines reported high packet loss, and displayed as warning/critical. I know this is because the Orion poller wasn't able to do ICMP pings, although it seems like this was a higher failure rate than I was expecting. The agents are all set with default polling intervals of 2 minutes.

I was wondering: How do people avoid getting spammed with alerts when they have an extended maintenance like this?

We don't have any alerts set up right now, but I can see people wanting to alert on high packet loss, and the long polling interval means it will likely fire.I was thinking we could flip the switch on our alerts to turn them off right before the maintenance, but that seems... dangerous, especially if flipping them back on is missed.

There's also the mental image of seeing everything in warning/critical state if they view the web console, but I think I can educate people about that.

  • ahbrook  wrote:

    Hey everyone!

    The march of getting production set up continues on. We had to do some patching of our database last night, so I shutdown the Orion app server and left it that way until this morning.

    Upon restarting the database and then the app server / polling engine, most of our agent-based Windows and Linux machines reported high packet loss, and displayed as warning/critical. I know this is because the Orion poller wasn't able to do ICMP pings, although it seems like this was a higher failure rate than I was expecting. The agents are all set with default polling intervals of 2 minutes.

    I was wondering: How do people avoid getting spammed with alerts when they have an extended maintenance like this?

    We don't have any alerts set up right now, but I can see people wanting to alert on high packet loss, and the long polling interval means it will likely fire.I was thinking we could flip the switch on our alerts to turn them off right before the maintenance, but that seems... dangerous, especially if flipping them back on is missed.

    There's also the mental image of seeing everything in warning/critical state if they view the web console, but I think I can educate people about that.

    Suspend data collection or alerts for nodes in Maintenance Mode

  • Alright, so just throw the entire Orion environment in maintenance mode during the Orion maintenance?

  • ahbrook  wrote:

    Alright, so just throw the entire Orion environment in maintenance mode during the Orion maintenance?

    Yup! If you search on THWACK there are a few folks who have written up great reports to track muted and unmanaged entities as well.

    There are some customers who choose to utilize custom properties to help with organization when they're doing an upgrade that perhaps may not be able to be completed in the same timeframe.

    I've had one customer who has dozens of geographically distributed scalability engines, and 2 of those always take a relatively long time to get upgraded compared to the rest of his environment. He has labeled all the monitored entities for those 2 scalability engines, so while those scalability engines are down for maintenance they are migrated to one of the other polling engines that is already migrated and once the last 2 are upgraded, he moves them back. That minimizes downtime and also lessens the time he need to put those entities into maintenance mode.

  • Thanks for the advice! I'd done some searching and all I was finding were posts from 2004-2009, which may be out of date. emoticons_happy.png

  • Nice emoticons_happy.png, below are few simple tips to avoid an alert flood

    1. Firstly, don't start everything in parallel in such a scenario, once you start your database give it a few minutes like say 5 min approx, then start your services on Orion and give it some time to settle in (like say 5 mins) as your pumping everything at once. Then start your polling engine (give it some time to settle in same as above 5 mins), if you have multiple polling engines start them one after the other with 5 mins gap.

    2. Introduce a wait time on your alert, when you create an alert ideally we don't introduce a wait period, since you are polling your devices at 2 mins interval it fires an alert immediately on the 3rd minute, rather have a 5 min wait on your alert and the alert action is execute only if packet loss is high for 3 consecutive polling intervals on the device which is 2 mins X 3.

    3. If you do not have a HA module in your environment, when you start orion services on primary poller start all services except for 'SolarWinds Alerting Service V2' keep this on Stopped state till your Orion settles down and then start this service, no alerts will be fired in this case (but again this is manual unlike point 2) - either go with point 2 or 3, not both at the same time (choose the best option that you did prefer).

    4. There is another generic option as well that orion provides - but if you use this there is no need for you to use point 2 or 3, under 'All Active Alerts -> More -> Pause actions of all alerts', before such maintenance change, before you stop your Orion services and Database, select this option, this would stop actions on all alerts that you have configured at once, complete your change where you need to shutdown your DB and Orion App server, once you start your Orion i mean everything, after 5 or 10 mins go back to the same option and uncheck Pause actions of all alerts which enables the alert actions for all alerts in your environment. (Please do not get confused with what you have mentioned - there is no need to disable your alerts when you use this option, keep your alerts on/enabled, when you use this option you are only disabling the actions associated to it and hence no mails or tickets or any other action on the alert would fire which causes a spam)

    These are few generic points which can be used, hope it helps.

    For sure you would receive more tips from others on thwack emoticons_happy.png

  • Thanks!

    I just noticed, I don't see the polling engine in the Orion Service Manager. Is that just the Solarwinds collector Service?

    I think I need more caffeine this morning.

  • Polling engine is an additional Orion server in your environment it is not a service, if you just have one orion app server then don't bother about it.

    So you just have One Orion Server and One DB associated to it correct ?

  • Correct (for now). I was just looking how to stage the polling engine to start later than the other services.

  • On your webpage just click on 'Settings -> All Settings -> Polling Engines' to verify the same, you will only see your primary or main polling engine on it in that case.

  • If you are going with point 4 which is stopping the alerting service, just open your Orion Service Manager, start all services and immediately click on 'SolarWinds Alerting Service V2' dont pause emoticons_happy.png and stop it, this should do the trick emoticons_wink.png

    Or you could even start the other service one after the other except for this, or even under services.msc you could disable this alerting service even before you do this - either ways just choose the best way to go with, the other point that i did mention 'pause actions of all alerts' that would do the trick to.