Alerts - "Condition must exist for more than" and Time of Day behaviour/anomalies.

Question

I don't think the logic behind how alerts are "run" is adequately explained anywhere and assumed logic based on what it presented in alert definitions doesn't work like it should.

Background:

1) Fleet of Test/Dev servers where notifications/events of interest only during Business Hours (Mon-Fri 07:30 to 18:30)

2) Custom Property (Environment) defined - Test/Dev servers have values of "Test" or "Development" - Prod servers have value of "Production"

2) Custom Property (Alerting_Schedule) defined - Test/Dev servers all have value "Business Hours" for this CP

3)Critical CPU alert defined - specifically covering the Test/Dev servers.

4)Critical CPU alert set up with Time of Day schedule - enabled between 07:30 to 18:30

5) Critical CPU alert:

Evaluate the trigger condition every 10 minutes

Scope of Alert:

All child conditions must be satisfied:

- Node Alerting_Schedule is equal to Business Hours

and

All child conditions must NOT be satisfied:

- Node DEVICE_FUNCTION is equal to SAN System

- Node Environment is equal to Production

Actual Trigger Condition:

All child conditions must be satisfied:

Node Critical Value Reached (CPU Load Threshold) is equal to Yes

- Condition must exist for more than 2 hours (this box ticked and therefor this condition is active/evaluated)

Reset Condition

Reset this alert when trigger condition is no longer true (Recommended)

Time of Day

Schedule is set:

Enable Alert during time Period

Frequency - Daily

Enable every Business Day (Mon-Fri)

From: 07:30 to 18:30

Trigger Actions:

Various Actions defined - main one a Log to NetPerfMon log which result in an Incident Ticket raised via SNow automation

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

The issue:

Sample Server A:

CPU hits Critical Threshold (95%) on Tuesday at 13:30.

Alert will "see" this - but due to the "Condition must exist for more than 2 hours" there is no Alert Triggered (and no "visible" Alert processing that I can find).

CPU Critical Threshold still at 95% on Tuesday 15:30

Alert "sees" this - the "Condition must exist for more than 2 hours" is met (has been at 95% for 2 hours) and the Trigger Actions are executed.

We get an Alert in the "Active Alerts" console and we get an Incident Ticket raised (via Event Log - SNow automation).

All the above occurred during the Time of Day schedule set for the Alert (Business Day (Tuesday) between 07:30 and 18:30 (occurred between 13:30 and 15:30)

Sample Server B

CPU hits Critical Threshold (95%) on Tuesday at 18:15

Alert will "see" this - but due to the "Condition must exist for more than 2 hours" there is no Alert Triggered (and no "visible" Alert processing that I can find).

CPU Critical Threshold still at 95% on Tuesday 18:30

This is the "end" time of the Time of Day schedule for the alert.

Expected behaviour at this point is that the Alert evaluation would be terminated (disabled?) and no monitoring for this Alerts conditions would take place until 07:30 next morning,

19:00 - Server is shutdown (SOP for many of the Test/Dev servers) (this is actual event flow but the issue doesn't change regardless if server up or down)

20:15 - CPU Critical Threshold still at 95% - presumably last polled/recorded status before server shutdown therefor the data that is evaluated against from the DB

Alert "sees" this - the "Condition must exist for more than 2 hours" is met (has been at 95% for 2 hours) and the Trigger Actions are executed.

We get an Alert in the "Active Alerts" console and we get an Incident Ticket raised (via Event Log - SNow automation).

So it seems that the Time of Day schedule is not necessarily truthful in that Alerts won't trigger - if the "Condition must exist for more than" value exceeds the end time.

This is not what is expected (when setting the Time of Day schedule) and is not described/explained anywhere that I could find.

Logic/expectations says that the Alert should be disabled at 17:30 and no occurrences should appear after that time.

I am guessing that when the original alert conditions are met (the 1st time 95% is "seen") - the alert process/"evaluation" doesn't terminate (as such) and restart at the next 10 minute cycle, but forks off a background task or something related to this "occurrence"/object which independently runs/evaluated every 10 mins for that specific occurrence/object and this doesn't honour the ToD settings.

I realise there are workarounds for this issue (e.g. overriding Critical Thresholds for Servers so only set after 12 polling cycles and removing the "Condition must exist for more than" from the alert etc.). I think, however that setting the ToD schedule should work as described.

Any input, advice, pointers on what I may have missed etc, are welcome.

ralphpost · Answer

Haven't tested with a condition of Node is not equal to Down - will do that to see the results but I have a suspicion that it may not change the result (if in fact the evaluation is taken from the 1st trigger trip and "remains" for the execution duration).

There are other workarounds (i.e. set a schedule on the Trigger Actions the same as ToD schedule - would stop the Actions executing (presumably) etc.) - but these kind of workarounds (while stopping an email or log event/incident ticket) will still always result in an Alert appearing on the Active Alerts console when there (in my opinion) there should not be one.

Copy on the feature request suggestion and agree unlikely to see any action for it - might get more traction if I open a Support Case perhaps?

mesverrum · Answer

Haven't tested it myself, but your testing sounds believable and lines up with other quirky behavior I have seen over the years.  I think one easy fix would be to just make it so one of your conditions is that the server status is not down, but like you said that doesn't change the fact that the time of day stuff behaves oddly.  I think the code for that stuff is like a decade old so maybe you can submit a feature request to get it fixed, but I wouldn't hold my breath that anyone is going to crack open the logic of the alert engine since a bad change in that realm could be really disruptive.