I manage an NPM environment where we have categorized all of our nodes based on impact and urgency to help make it easier for the NOC/Service Desk to categorize and respond to the alerts. In order to leverage that information further, we used the impact to define the Stat Collection interval and the urgency to set the poll interval. This means that higher priority nodes have more granular availability and performance statistics and allows us to dedicate compute resources to nodes that needs the increased granularity.
When building alerts we want to delay triggering performance alerts for a period of time to smooth out the spikes in performance. We are already using the average metric to help smooth the wrinkles (where available), but would like to make sure that we aren't over-smoothing. Let me explain.
If we build an alert that says trigger when current cpu utilization is > 98% (I don't like to use 100% as a check of = 100% seems too absolute for my liking) trigger an alert only after this condition has existed for 30 minutes. This works fine for nodes that collect statistics based on the default 15 minute Stat Collection interval, but we have nodes that collect on both 5 and 10 minute intervals. Those higher priority nodes would need to have 6 and 3 (respectively) concurrent polling periods where CPU usage was above 98% before an alert was triggered whereas a default polling interval would only need to have 2 polling periods.
I`d like to avoid having to make an alert for each impact level where I adjusted the 'alert after' condition to reflect the Stat Collection interval, but I can't think of a way to do this in a single alert statement. For example, I was thinking that I would change the condition exists statement to match 2 x stat collection interval. This would ensure at least 2 concurrent statistics above the threshold to help smooth things out a little, but it does mean a different alert for every Stat Collection interval.
Any ideas? I'd like to try to keep away from custom SQL, but I'll go that route if I have to in order to keep it to a single alert statement. (Maybe .. )