This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Node DOWN/Node Reboot Alert for 3,000 servers

Hi Guys

We have many servers to monitor and at the moment with SAM thrown out there Out-of-the-Box there is currently no logical categorisation/grouping of servers that suit logical and manageable alerting. I am tasked to categorise the environment but for now I need a couple of alerts to be in place whilst this work is being done.

REBOOTS

We have a group of 900 Application servers that are rebooted every night throughout the night (Citrix) so I don't want to be notified of these reboots. How do I create an Alert for Node Reboot that does not include or excludes these 900 servers during the night?

I also want the alert for the Node Reboot only to alert me if the Node has been Rebooted and not come back up after 5-10 mins, effectively then becoming a Node DOWN.

Ideally I need to know when any/all nodes are rebooted regardless of the above but at the moment this isn't feasible.

In advance thank you for your help.

  • To exclude the Citrix servers, you can define specifics in the trigger conditions.  Depending on your environment, you may have to use multiple conditions.  For example, if the names all contain CTX, then you can put that in the exclusions.

    pastedImage_0.png

    In order to trigger the alert after the 5-10 timeframe, you would just need to specify the condition must exist for more then X minutes.

    pastedImage_1.png

    One thing to note, is you would also need to set up the alert to poll frequently.  If you are only polling every 10 minutes, then the box could be down for 20 minutes before you get alerted (10 minute polling, plus 10 minute condition).

  • For the node reboot alert, essentially that one just looks at the device boot time and triggers if it ever changes, trying to build a timeout condition would involve some much more complex SQL logic where it check the new boot time, looked for down events in the previous 5-10 minutes and only trigger when there was no down event.  Doable, but definitely not user friendly to implement.

  • There is a 'Lastboot' option available, unfortunately I don't think you are going to be able to easily tie that together with a down event because in order to change/trigger 'Lastboot'

    Orion needs to be able to read that value from the node using SNMP/WMI. If the node is down then it will not be responding to either.

    pastedImage_0.png

    I would agree with mprobus that simply changing the polling time then the number of minutes before the alert triggers would be the easiest.  Poll every 3 minutes and set the alert 'condition must exist'  to 6 minutes would get you down to 6 minutes and two consecutive polls before it triggered.