1 Reply Latest reply on Dec 5, 2017 10:55 AM by tigger2

    Alert notification

    joseph reddy

      I have set alert condition if any of the below condition is not met, should alert me

      CPU greater than 80%

      Memory greater than 80%

      volume free space availability less than 20%

      Node status is down

       

      Set it condition must exist for 5 mins.

       

      I was able to get an alert if that node do not met the condition.

       

      I received an email notification saying that a problem was detected on particular node:

       

      ex: Memory: 10%

           CPU load: 5%

           Node status: down

       

      I have couple of questions here:

                How long was this node down?

                Did it stay down or come back up after 5 mins later ?

      Is there anyway that we could determine these things. Also how can we send email notification once after the node is up again.

       

      Could someone assist on this.

      Thanks in advance

        • Re: Alert notification
          tigger2

          If you set the alert condition so it's "if any of the below condition is not met" the node can be up and CPU be > 80% and the alert will trigger.  You may not want this all in one alert (due to reasons below) because it's putting a lot of separate logic/conditions that are not really related into a single alert, making the alert you get a (possibly) little hard to interpret.

           

          I would suggest setting an alert for each condition (or at least just the node down one if you want to have as few as possible), so you would have an alert for node down, a second one for if CPU > 80%, another one for Memory > 80%, etc.

           

          To alert on "node is back up" (or any of these conditions clear, if they are separate conditions), there is a "reset condition" that you may need to configure (it defaults to "Reset this alert when trigger condition is no longer true") that causes the alert to reset, then you have to configure an alert under the "reset action"

           

          If you use separate alerts, then for the "how long was the node down" issue it would be the time between the "node down" alert being generated, and the "node up" alert being generated.  It's not exact, but it's going to be close to actual up/down of the node due to polling and alerting delays.

           

          *if* you want to keep the single alert with many conditions, I'd suggest putting in the alert text (i.e. the email message body, if you're using email) *all* of the variables that describe the status of each condition you're alerting on so when you get an alert you see the %CPU used, % Memory Used, Node Status (up/down) and you can visually figure out which ones are issues (and figure out the node down time, based on getting an alert where the Node status = down, then another where Node status = up).  The hard part of this is that I don't think you'll be able to get the Volumes easily listed as a node can have 1 or more volumes and you can't easily say "list all the volumes with their % usage" in an alert unless...maybe?... you create a custom variable and write a SQL/SWIS query to return a list of data into that variable.  If instead you had a separate alerting rule for just volume free space < 20% it makes this simpler as the alert is triggered for each specific volume, so you can use a single variable to indicate what volume it is.

          1 of 1 people found this helpful