This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

How to Create an Alert if Multiple Members of a Group are down

I have a requirement to create an alert if 5 or more members of a group (excluding a couple of sub-groups) are down for more than 4 hours. Initially I tried using the GUI (see image), but when it outputs the alert to the event table it includes the event text for every affected node. This breaks our integration with our servicedesk application which only expects a single set of alert text so I've been working on a SWQL alternative.

While I can pull out the list of devices that are down (I've not included the 4h timespan yet) I haven't worked out how I set the trigger up to fire if the SWQL returns more than 5 or more rows. So how do I do that?

SELECT n.NodeID, n.Caption, n.Status, n.StatusDescription, n.LastSystemUpTimePollUtc, ncp.AlertingRuleSet, a.ID, c.ContainerID, c.Name
FROM Orion.Nodes n
INNER JOIN Orion.NodesCustomProperties ncp ON n.NodeID =ncp.NodeID
LEFT JOIN Orion.AlertSuppression a ON n.uri = a.entityuri
INNER JOIN Orion.ContainerMembers cm ON cm.MemberPrimaryID = n.NodeID
INNER JOIN Orion.Container c ON c.ContainerID = cm.ContainerID
WHERE ncp.AlertingRuleSet = 'Prod'
AND c.Name = 'Production'
and c.Name NOT IN ('Dev','Test')
AND a.ID IS NULL
ORDER BY n.Caption

Parents
  • I found this example which I thought I could adapt to fit my use case. This runs fine and returns groups where 5 or more members are down. As you can see it has a subquery.

    SELECT Groups.Uri, Groups.DisplayName FROM Orion.Groups AS Groups
    WHERE Groups.ContainerID IN (
     SELECT g.ContainerID
     FROM Orion.Groups g
     WHERE g.Members.Status = 2 -- Down status
     AND g.Members.MemberEntityType = 'Orion.Nodes'
     GROUP BY g.ContainerID
     HAVING COUNT(g.Members.MemberPrimaryID) >= 5
    )

    That's a decent start but I need to filter out nodes that are muted and that have a specific custom property. So I adapted as below but when I run this I get an error saying subqueries are not supported.

    SELECT Groups.Uri, Groups.DisplayName FROM Orion.Groups AS Groups
    WHERE Groups.ContainerID IN (
    SELECT g.ContainerID
    FROM Orion.Groups g
    LEFT JOIN Orion.AlertSuppression a ON a.EntityUri = g.Members.MemberUri
    INNER JOIN Orion.Container c ON c.ContainerID = g.ContainerID
    INNER JOIN Orion.NodesCustomProperties ncp ON ncp.Node.Uri = g.Members.MemberUri
    WHERE g.members.MemberEntityType = 'Orion.Nodes'
    AND ncp.AlertingRuleSet = 'Prod'
    AND g.Name = 'Production'
    AND g.Members.Status = 2
    AND a.ID IS NULL 
    GROUP BY g.ContainerID
    HAVING COUNT(g.Members.MemberPrimaryID) >= 2
    )

  • Have you tried doing a mix of the regular GUI type alert on the top and adding a subsection with the SWQL at the bottom? This is as close as an example as I can show but it should work for Group alerts too. This is mixing Node down AND Snmp Traps into the same alert but should allow some more filtering for you. May be this can get you around some of the SWQL limitations.

    Here is the GUI main part 

    And the subsection part of the alert 

  • Thanks, that gave me a way of doing it without even having to use SWQL. First section selects the group with "condition must exist" and "alert can be triggered if" left unchecked. Second section selects nodes which have a specific custom property and has the "condition must exist" set to 4h and "alert can be triggered if more than or equal to 5". Testing with a shorter time and fewer down nodes (to match the current number of nodes that are down) confirmed it's working now.

Reply
  • Thanks, that gave me a way of doing it without even having to use SWQL. First section selects the group with "condition must exist" and "alert can be triggered if" left unchecked. Second section selects nodes which have a specific custom property and has the "condition must exist" set to 4h and "alert can be triggered if more than or equal to 5". Testing with a shorter time and fewer down nodes (to match the current number of nodes that are down) confirmed it's working now.

Children
No Data