8 Replies Latest reply on Apr 24, 2018 11:26 AM by m-milligan

    SAM - Elevating Severity Alerts

    jl.lhealth

      Hey guys,

       

      I've hit a bit of a wall and I'm coming up on a deadline for getting SAM setup to replace our current monitoring solution (Icinga).  While I've been able to setup all of the basic monitors and alerts we had in the last environment, I'm having some struggles getting the email alerting to do what I expect, and I'm wondering if I'm just building all of our alerts completely backwards or if I'm missing something obvious.

       

      I've set the trigger conditions in such a way that we have thresholds and severity (as well as HTML emails) which send as an issue increases in severity like such:

       

      Informational CPU:

      Trigger: 80 <= X < 90

      Reset: X < 80

       

      Warning CPU:

      Trigger: 90 <= X < 95

      Reset: X < 80

       

      Critical CPU:

      Trigger: 95 <= X

      Reset: X < 80

       

      My reasoning is that I wanted the alert to clear (and another alert email to trigger) if the CPU moved into the next threshold, but I didn't want the reset (a Green "all clear" email) to fire unless the issue was resolved, not just moving from Warning to Critical.  I also wanted to prevent the NOC view from seeing an Informational, Warning, and Critical alert for the same machine, as we've already had issues with the team ignoring a Warning as they also saw the Informational alert in the view.

       

      As a note, I have tried removing the upper thresholds and only setting a reset on the lowest threshold.  This worked, but the additional alerts in the NOC view confused our team.  I've also tried adding the thresholds with only the lower reset trigger, but if we get too large of a jump between polls (Moving right from OK to Warning back to OK, we never get the reset/all clear email.  Lastly, I've considered adding an "OK" alert, however this added clutter to both the Node/Object views (since they always had "triggered alerts") and added additional confusion to the NOC team when they went to the "All Alerts" view.

       

      Am I coming at this backwards or is there a simple setting I'm missing?  The NOC view works perfectly now, however the emails aren't behaving as I want (definitely a configuration issue on my side).  Any advice or pointers would be greatly appreciated.

       

      Thank you,

      -JD

        • Re: SAM - Elevating Severity Alerts
          rooster76

          That is something that I have wanted as well.  The escalation levels are good and all, but sometimes alerts need to have the different levels of severity.  In addition to your suggestions, I think it would be great if there was an action you could add that would modify the alert severity if an alert has been in that status for x hours.

           

          Graduating alerts for both going up and down would be really awesome.  And if we had a choice for what email goes out when it moves up or when it moves down would make alerting be really powerful.

          • Re: SAM - Elevating Severity Alerts
            m-milligan

            I, too, would like Solarwinds to implement this. Can you create a feature request? That gets it on their radar.

            • Re: SAM - Elevating Severity Alerts
              mesverrum

              I've been able to work around this kind of issue by using custom swql queries to only show the highest severity alert of a particular type on a particular object.  It's pretty complicated though if you aren't a sql type.

               

              This does raise the question though, if the NOC response to an informational alert is to do nothing then why put it on their screen?  I generally subscribe to a philosophy of only showing alerts that require action immediately because I find that people are way too likely to ignore alerts unless you condition them that everything in the alerts requires immediate response.  Informational and warning type stuff can just get summarized into some reports.

                • Re: SAM - Elevating Severity Alerts
                  m-milligan

                  SQL and SWQL don't scare me. Would you share your queries?

                    • Re: SAM - Elevating Severity Alerts
                      mesverrum

                      This is a stripped down example of the type of logic I've used.  In this example the alert would have the same name but with a priority on the end, so I have a cpu alert p1 and a cpu alert p2, you can go on to include additional priorities if you want to type them all out.

                       

                      The p1 alert is just set up to alert when critical threshold is exceeded,  the p2 is when the warning threshold is exceeded, both would be active at the same time but clutters up the dashboard, so we want to bury lower priorities under the higher one, if it triggers.

                       

                      So I left join a query that only gets the crit version of the alert, matching on the alert name with the Px part replaced out (replace was only added in 12.1 or maybe 12.2 i think, so it won't work this way in older releases)

                       

                      SELECT
                      case when p1.name is not null then p1.name
                      else o.AlertConfigurations.Name end as [Alert Name]
                      ,'/Orion/NetPerfMon/ActiveAlertDetails.aspx?NetObject=AAT:' + case when p1.name is not null then ToString(p1.AlertObjectID) else ToString(o.AlertObjectID) end AS [_LinkFor_ALERT NAME]
                      ,CASE
                      WHEN p1.name is not null THEN '/Orion/images/ActiveAlerts/Critical.png'
                      else '/Orion/images/ActiveAlerts/Warning.png'
                      END AS [_iconfor_ALERT NAME]
                      ,o.EntityCaption AS [ALERT OBJECT]
                      ,o.EntityDetailsURL AS [_LinkFor_ALERT OBJECT]
                      ,case
                      WHEN o.RelatedNodeCaption=EntityCaption THEN 'Self'
                      When o.RelatedNodeCaption!=EntityCaption THEN RelatedNodeCaption
                      End as [RELATED NODE]
                      ,o.RelatedNodeDetailsURL AS [_LinkFor_RELATED NODE]
                      ,ToLocal(o.AlertActive.TriggeredDateTime) AS [ALERT TRIGGER TIME]
                      -- ,o.AlertActive.TriggeredMessage AS [ALERT MESSAGE]
                      --,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_ALERT OBJECT]
                      ,'/Orion/images/StatusIcons/Small-' + p.StatusIcon AS [_IconFor_RELATED NODE]
                      ,CASE
                      when p1.name is not null and minutediff(p1.TriggeredDateTime,GETUTCDATE())>1440 then (tostring(round(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())/1440.0,1)) + ' Days')
                      when p1.name is not null and minutediff(p1.TriggeredDateTime,GETUTCDATE())>60 then (tostring(round(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())/60.0,1)) + ' Hours')
                      when p1.name is not null then (tostring(minutediff(p1.TriggeredDateTime,GETUTCDATE())) + ' Minutes')
                      when minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())>1440 then (tostring(round(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())/1440.0,1)) + ' Days')
                      when minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())>60 then (tostring(round(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())/60.0,1)) + ' Hours')
                      else (tostring(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())) + ' Minutes')
                      end as [Time Active]
                      ,aa.AcknowledgedBy
                      ,ah.Message as [Note]
                      
                      From Orion.AlertActive aa
                      join Orion.AlertObjects o on aa.alertobjectid=o.alertobjectid
                      LEFT join Orion.Nodes p on p.nodeid=relatednodeid
                      left join orion.alerthistory ah on ah.AlertActiveID=aa.AlertActiveID and ah.EventType in (2,3)
                      
                      ---------------------- p1 critical alerts
                      left join (SELECT o.entitynetobjectid, o.AlertConfigurations.Name, o.AlertObjectID, o.AlertConfigurations.Severity, o.AlertActive.TriggeredDateTime
                      
                      From Orion.AlertActive aa
                      join Orion.AlertObjects o on aa.alertobjectid=o.alertobjectid
                      left join orion.alerthistory ah on ah.AlertActiveID=aa.AlertActiveID and ah.EventType in (2,3)
                      
                      
                      where o.AlertConfigurations.Severity = 2
                      ) p1 on p1.entitynetobjectid = o.entitynetobjectid and replace(p1.Name,'p1','') like replace(o.AlertConfigurations.Name,'p2','')
                      
                      ---------------------- p2
                      where o.AlertConfigurations.Severity = 1
                      
                      
                      ORDER by o.AlertActive.TriggeredDateTime DESC

                       

                      I'd fuss with it some more before leaving it with a client, but thats the basic idea of using case logic and some joins to allow higher priorities to replace any lower priority versions of the same alert.