Resource Metric Baseline

Introduction

DPA can currently alert on resource metrics that exceed specific threshold values. However, some customers have wanted DPA to alert based on baselines. For example, my CPU typically averages about 40% and never deviates that much, but I want to know if that jumps to 60%. The default threshold for CPU % is 90% and an alert would not get triggered. However, if you calculated baselines for CPU, it might have a 90th percentile value of 47, so if you use that within the alert, 60% would trigger the alert.

This is similar to the baselines that can be turned on when viewing resource metrics in DPA. In this chart the blue line is each metric value and the gray shading is the 90th and 10th percentiles.

90th Percentile

The attached script uses a CTE to calculate 90th and 10th percentiles based on 30 days of detailed data for each metric. These percentile values are bucketed into hourly timeframes so we can compare today at 9:15 am to the "normal" 9:00 am - 10:00 am values from history. The most recent values are compared to the 90th percentile only (but you could add a WHERE clause if you also want to know when a metric goes too low below the 10th percentile) and if they are more than 25% higher, the alert gets triggered. I found that if I used just the raw 90th percentile values, the alert was a little too noisy. However, if the current value is quite a bit higher, that seemed more reasonable.

Specific Metrics

This alert is also is configured to look at specific metrics I care about and these metrics must be ones where the larger the number is a problem. For example, Active Sessions would concern me if it suddenly went higher from 7 to 20. Metrics like Page Life Expectancy should get triggered the lower the value is, and would probably be compared against the 10th percentile as well and this alert does not support it currently. If you want to adjust the alert to handle metrics like me, post a comment if you have questions.

Adjusting the Alert Script

At the top of the script are two variables that can be adjusted. The DaysOfHistory should remain at 30 because that's is the default data retention for detailed data. Also, the list of metrics can be changed as well to include other metrics. The list of metrics that are collected can be found by running a query against the CON_METRICS_NAMES_XX where XX is the associated ID value from the COND table for the instance.

Near the bottom of the query you can adjust the 25% higher than the 90th percentile as well as a floor on the value of the metric. If some metrics are always low, and then they double, it is probably still not an issue. For example, if Active Sessions is typically 1, and it jumps to 3, it would get excluded.

Configuring the DPA Alert

To add this alert to DPA, create a custom alert of type Custom SQL Alert - Multiple Numeric Return and make it look similar to this: