I don't think the logic behind how alerts are "run" is adequately explained anywhere and assumed logic based on what it presented in alert definitions doesn't work like it should.
Background:
1) Fleet of Test/Dev servers where notifications/events of interest only during Business Hours (Mon-Fri 07:30 to 18:30)
2) Custom Property (Environment) defined - Test/Dev servers have values of "Test" or "Development" - Prod servers have value of "Production"
2) Custom Property (Alerting_Schedule) defined - Test/Dev servers all have value "Business Hours" for this CP
3)Critical CPU alert defined - specifically covering the Test/Dev servers.
4)Critical CPU alert set up with Time of Day schedule - enabled between 07:30 to 18:30
5) Critical CPU alert:
Evaluate the trigger condition every 10 minutes
Scope of Alert:
All child conditions must be satisfied:
- Node Alerting_Schedule is equal to Business Hours
and
All child conditions must NOT be satisfied:
- Node DEVICE_FUNCTION is equal to SAN System
- Node Environment is equal to Production
Actual Trigger Condition:
All child conditions must be satisfied:
Node Critical Value Reached (CPU Load Threshold) is equal to Yes
- Condition must exist for more than 2 hours (this box ticked and therefor this condition is active/evaluated)
Reset Condition
Reset this alert when trigger condition is no longer true (Recommended)
Time of Day
Schedule is set:
Enable Alert during time Period
Frequency - Daily
Enable every Business Day (Mon-Fri)
From: 07:30 to 18:30
Trigger Actions:
Various Actions defined - main one a Log to NetPerfMon log which result in an Incident Ticket raised via SNow automation
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
The issue:
Sample Server A:
CPU hits Critical Threshold (95%) on Tuesday at 13:30.
Alert will "see" this - but due to the "Condition must exist for more than 2 hours" there is no Alert Triggered (and no "visible" Alert processing that I can find).
CPU Critical Threshold still at 95% on Tuesday 15:30
Alert "sees" this - the "Condition must exist for more than 2 hours" is met (has been at 95% for 2 hours) and the Trigger Actions are executed.
We get an Alert in the "Active Alerts" console and we get an Incident Ticket raised (via Event Log - SNow automation).
All the above occurred during the Time of Day schedule set for the Alert (Business Day (Tuesday) between 07:30 and 18:30 (occurred between 13:30 and 15:30)
Sample Server B
CPU hits Critical Threshold (95%) on Tuesday at 18:15
Alert will "see" this - but due to the "Condition must exist for more than 2 hours" there is no Alert Triggered (and no "visible" Alert processing that I can find).
CPU Critical Threshold still at 95% on Tuesday 18:30
This is the "end" time of the Time of Day schedule for the alert.
Expected behaviour at this point is that the Alert evaluation would be terminated (disabled?) and no monitoring for this Alerts conditions would take place until 07:30 next morning,
19:00 - Server is shutdown (SOP for many of the Test/Dev servers) (this is actual event flow but the issue doesn't change regardless if server up or down)
20:15 - CPU Critical Threshold still at 95% - presumably last polled/recorded status before server shutdown therefor the data that is evaluated against from the DB
Alert "sees" this - the "Condition must exist for more than 2 hours" is met (has been at 95% for 2 hours) and the Trigger Actions are executed.
We get an Alert in the "Active Alerts" console and we get an Incident Ticket raised (via Event Log - SNow automation).
So it seems that the Time of Day schedule is not necessarily truthful in that Alerts won't trigger - if the "Condition must exist for more than" value exceeds the end time.
This is not what is expected (when setting the Time of Day schedule) and is not described/explained anywhere that I could find.
Logic/expectations says that the Alert should be disabled at 17:30 and no occurrences should appear after that time.
I am guessing that when the original alert conditions are met (the 1st time 95% is "seen") - the alert process/"evaluation" doesn't terminate (as such) and restart at the next 10 minute cycle, but forks off a background task or something related to this "occurrence"/object which independently runs/evaluated every 10 mins for that specific occurrence/object and this doesn't honour the ToD settings.
I realise there are workarounds for this issue (e.g. overriding Critical Thresholds for Servers so only set after 12 polling cycles and removing the "Condition must exist for more than" from the alert etc.). I think, however that setting the ToD schedule should work as described.
Any input, advice, pointers on what I may have missed etc, are welcome.