This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

NPM: Mass Outage Alerting- Over 50 nodes down, Send Alert

Hello,

I have created an alert that will notify the necessary team members if there are over 50 nodes down, but I have ran into some issues. I have already opened a support case and they recommended that I reach out here. The general flow of this alert is that if there are 50 nodes down at a given time the alert will trigger, and then it will reset when there are less than 50 stores down. This alert is powered by the following SQL Queries: (also, this frequency on this alert is every 5 minutes)

The Trigger Query is as follows:

(I had to use a TOP 1 in the SELECT Statement because if I did not I would receive an alert email for every node down, and we want a single email stating there are over 50 stores down.)

SELECT TOP 1 Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

FROM Nodes

WHERE StatusDescription LIKE 'Node Status is Down.'

GROUP BY NodeID, Caption

HAVING (SELECT COUNT(*)

FROM Nodes

WHERE StatusDescription LIKE 'Node Status is Down.')>=50

The Reset Query:

(I tried modifying the reset query to include a TOP 1, but when I change the reset query it automatically removes the TOP 1 in the database.)

SELECT {TOP 1} Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

FROM Nodes

WHERE StatusDescription LIKE 'Node Status is Down.'

GROUP BY NodeID, Caption

HAVING (SELECT COUNT(*)

FROM Nodes

WHERE StatusDescription LIKE 'Node Status is Down.')<50

I have tested many different options, and this seems to be the best solution. The trigger action is executing successfully whenever there are more than 50 stores down, and if there are less than 50 stores down the next time it runs the query (every 5 mins), the reset action is working correctly. However, if the trigger action is executed and the next time the query runs there are still over 50 stores down it will send another trigger action and then we will never receive the reset email. I believe this issue occurs if the TOP 1 device changes and there are still over 50 stores down. Any help is appreciated!

The next suggestion from Solarwinds support is to submit a feature request where the Node Groups will contain a value of the total number of stores down within that group. Any thoughts on this feature request?

Thanks,

Troy

  • My reply to this thread is very late, however I ran across a similar scenario and I needed a solution. Hopefully if someone else runs across a similar problem this can help them. I'm sure there is a much better way to this. In my scenario I needed to be alerted only if 2 smtp application monitors were down simultaneously on different nodes. Furthermore, there shouldn't be an alert sent if only a single application monitor was down and only a single alert should be sent. My solution, however, can also be applied to mass outages.

    Terms:

    Group - A collection of objects you are trying to poll for a similar status. This is not related to the groups in solarwinds.

    Polled Node - This is a node that is either part of the group, or an independent node that just has to exist.

    Here is an example based off my application monitor scenario:

    SELECT APM_ApplicationAlertsData.ApplicationID AS NetObjectID, APM_ApplicationAlertsData.Name AS Name

    FROM APM_ApplicationAlertsData

    WHERE(

    (APM_ApplicationAlertsData.Availability = 'Down') AND

    (APM_ApplicationAlertsData.Name = 'SMTPTest') AND

    (APM_ApplicationAlertsData.ApplicationID= '323')

    )

    GROUP BY APM_ApplicationAlertsData.ApplicationID, APM_ApplicationAlertsData.Name

    HAVING( SELECT COUNT(*) FROM APM_ApplicationAlertsData

    WHERE APM_ApplicationAlertsData.Availability = 'Down' AND

    APM_ApplicationAlertsData.Name = 'SMTPTest'

    ) = 2

    I have not modified the SELECT statement from what the Advanced Alert Manager generates. Here my "Polled Node" has the ApplicationID 323. This node is part of the "Group". I could have just as easily created a node that wasn't part of the group. If the "Polled Node" isn't part of the group we want our WHERE section to always cause the query (omitting the Having section) to return our "Polled Node".

    The HAVING statement is where the magic happens. Our sql query wont return our "Polled Node" (even if it is down) unless we have at least 2 application monitors named SMTPTest in the down state.

    The reset action can't be left default.

    SELECT APM_ApplicationAlertsData.ApplicationID AS NetObjectID, APM_ApplicationAlertsData.Name AS Name

    FROM APM_ApplicationAlertsData

    WHERE(

    (APM_ApplicationAlertsData.Name = 'SMTPTest') AND

    (APM_ApplicationAlertsData.ApplicationID= '323')

    )

    GROUP BY APM_ApplicationAlertsData.ApplicationID, APM_ApplicationAlertsData.Name

    HAVING( SELECT COUNT(*) FROM APM_ApplicationAlertsData

    WHERE APM_ApplicationAlertsData.Availability = 'Down' AND

    APM_ApplicationAlertsData.Name = 'SMTPTest'

    ) != 2

    Here the WHERE statement will always return true as long as that application monitor exists, however our reset sql query wont return our "Polled Node" until we no longer have 2 SMTPTest application monitors down.

    The concept is similar to what the original poster had, however OP is using the TOP command to select the first node for his alerts. Well what if that node goes up, but there are still 50 nodes down? Another alert will be generated about 50 nodes being down on the new first node. Instead of selecting the first node, just create a brand new node and name it "Mass Outage". Poll this node in the "Where" section.

    Where the NodeID of the "Mass Outage Node" is 311 (made up number)

    SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

    FROM Nodes

    WHERE Nodes.NodeID = '311'

    GROUP BY NodeID, Caption

    HAVING (SELECT COUNT(*)

    FROM Nodes

    WHERE StatusDescription LIKE 'Node Status is Down.')>=50

    Reset:

    SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

    FROM Nodes

    WHERE Nodes.NodeID = '311'

    GROUP BY NodeID, Caption

    HAVING (SELECT COUNT(*)

    FROM Nodes

    WHERE StatusDescription LIKE 'Node Status is Down.')<50

    Please let me know if I'm retarded and this isn't working how I think it is. I rarely touch SQL.

  • Thanks for sharing your alerts. I'm not sure how exactly you suppress the generic application and node alerts with this.

    For your user case, you can avoid custom SQL by using custom properties and Groups.

    - Create an Application Yes/No CustomProperty called AppAlert. Set this custom property for all your application monitors to Yes except the SMTP ones for which you set them to No

    - Modify the default 'alert me when a component *' and 'alert me when an application *' monitors to add an additional trigger condition of AppAlert=Yes. This will suppress the default alerts for your SMTP monitors.

    - Create a Group using the 2 SMTP monitors, and use the Group status of 'Best case' or the default 'Mixed mode' (if 1 of them is down, the group will be in a Up or Warning status depending on the setting, but will be in Down status when both are down).

    - Simply use the default 'Alert me when a group is down' to be notified when both SMTP monitors are down.

    For the node count > 50 alert, it is a really good bit of custom SQL there emoticons_happy.png.  It will still notify on all the 50 nodes using the 'Alert me when a node is down' alert, but the original requirement was to notify someone else when the node down count > 50. So, this would be a separate alert and more than serves that requirement. BTW, you can also use WHERE Status=2 instead of WHERE StatusDescription LIKE 'Node Status is Down.'

  • My initial idea was to use groups for the SMTP monitor. However, what if the node is Up, but the SMTP monitor is down (You can't connect with TCP/Port 25)? The node will still report Up and therefore the group status will be Up/Mixed. If you could place application monitors into groups that would have easily solved the issue. In our specific case the node was actually external (and therefore group status is always external), because the SMTP server we are checking is not our own. I'm not utilizing a generic alert, but the custom properties to distinguish sets of nodes is a good idea.

    You are correct on the "WHERE StatusDescription LIKE 'Node Status is Down.'". That was from the original poster. I simply copy, pasted, and modified what he had done as a proof of concept.

  • You can put virtually any monitored object that has a status in a group - nodes, interfaces, application monitors, component monitors, custom pollers, ip sla operations, other groups, etc etc.

  • That would be a much easier way to accomplish my goal! Haha, had I known that would have saved me some time and effort. Thanks!