SQL Sentry Tips and Tricks: Tuning Advisory Conditions

If you’ve read my recent Tips & Tricks blogs around alert tuning — SQL Server Alert Tuning Basics With SQL Sentry and SQL Sentry Tips and Tricks: Common Alert Tuning Examples —then you know it’s a balancing act.

How do you find that perfect balance? Start with a data-driven approach.

In this blog, we’ll discuss tuning Advisory Conditions. You may find that default Advisory Conditions: Available Actions  are sensitive, but this is actually by design. By having default settings set a bit more toward the sensitive side, these alerts build up data that you can use to drive strategy for alert tuning. Let’s go through where that data is, and how to use it.

As we begin, there are three main questions to consider.

  1. For the occurrence being triggered, how long is it active?
  2. When are these occurrences triggering?
  3. What are the values ultimately triggering these occurrences?

Where to Start

The Environmental Health Overview on the Start Page is a good place to start, as it can quickly show you what conditions have high counts. Set the sample range to the max 24 hours and sort the results by Count descending.

Using the Environmental Health Score to identify Advisory Conditions with high counts
Using the Environmental Health Score to identify Advisory Conditions with high counts

This will give you some heavy offenders in creating noise to start with.

In larger environments, it may be better to query the SentryOne database for a broader overview. The below queries will assist with this. Note that there are two different groups. One for Advisory Conditions sending emails and another for Advisory Conditions set to Send to Alerting Channels (the brunt of what shows up on the Environmental Health Score). Additionally, I recommend that you narrow the time range to just the last 30-90 days. The default alert retention is a year. So, you want to make sure you’re looking at recent data, especially as you go through additional rounds of tuning.

Global Advisory Condition counts – Send to Alerting Channels/Health Score:

/* Global Advisory Condition counts – Send to Alerting Channels/Health Score*/
SELECT dcd.NAME AS Condition
     , Count(*) AS TheCount
FROM DynamicConditionDefinition dcd
INNER JOIN AlertingChannelLog acl
     ON dcd.ID = acl.DynamicConditionID
--WHERE acl.NormalizedStartTimeUtc > '2019-03-25 18:05:40.807'
-- AND ObjectName = ‘%<ServerName>%’ 
-- AND dcd.name = ‘%<ConditionName>%’ 
GROUP BY dcd.NAME
ORDER BY thecount DESC

Global Advisory Condition counts – Send Email:

/* Global Advisory Condition counts – Send Email*/
SELECT dcd.NAME AS Condition
     , Count(*) AS TheCount
FROM DynamicConditionDefinition dcd
INNER JOIN ObjectConditionActionHistory ocah
     ON dcd.ConditionID = ocah.ConditionTypeID
WHERE ActionTypeName = 'Send Email'
-- AND EventStartTime > '2018-05-21 10:49:06.137'
-- AND ObjectName = ‘%<ServerName>%’ 
-- AND dcd.name = ‘%<ConditionName>%’ 
GROUP BY dcd.NAME
ORDER BY thecount DESC

Advisory Condition counts by target – Send to Alerting Channels/Health Score:

/* Advisory Condition counts by target - Send to Alerting Channels/Health Score*/
SELECT ObjectName AS Target
     , dcd.NAME AS Condition
     , Count(*) AS TheCount
FROM DynamicConditionDefinition dcd
INNER JOIN AlertingChannelLog acl
     ON dcd.id = acl.DynamicConditionID
-- WHERE acl.NormalizedStartTimeUtc > '2019-03-25 18:05:40.807'
-- AND ObjectName = ‘%<ServerName>%’ 
-- AND dcd.name = ‘%<ConditionName>%’ 
GROUP BY ObjectName
     , dcd.NAME
ORDER BY thecount DESC

Advisory Condition counts by target – Send Email:

/* Advisory Condition counts by target – Send Email*/
SELECT ObjectName AS Target
     , dcd.NAME AS Condition
     , Count(*) AS TheCount
FROM DynamicConditionDefinition dcd
INNER JOIN ObjectConditionActionHistory ocah
     ON dcd.ConditionID = ocah.ConditionTypeID
WHERE ActionTypeName = 'Send Email'
-- AND EventStartTime > '2018-05-21 10:49:06.137'
-- AND ObjectName = ‘%<ServerName>%’ 
-- AND dcd.name = ‘%<ConditionName>%’ 
GROUP BY ObjectName
     , dcd.NAME
ORDER BY thecount DESC

How to Get to the Data

It's important to tune at the global level before getting too inundated at the, more granular, site, group, and target levels. This will ensure that any new targets have a tuned alerting system applied to them by default. It will also provide better data to work with as you begin to work deeper into the monitoring hierarchy.

Once you've found a condition to start with, open the Global Advisory Conditions List under the All Targets node in the Navigator pane.

Global Conditions List
Global Conditions List

Once there, you can filter the conditions list by typing into the Name column. Then right click the condition > Show Event Logs.

Filtering for and opening the Event Logs for a specific Advisory Condition in the Conditions List
Filtering for and opening the Event Logs for a specific Advisory Condition in the Conditions List

Now you should see the historical occurrences for the condition in question.

What to Do with the Data

As I mentioned earlier, three data points I like to look at are how long the occurrences being triggered are active for, when these occurrences are triggering, and what values are triggering these occurrences.

Occurrence Duration

I like starting with analyzing the duration spectrum of the occurrences. This way you’re less likely to obscure issues, because you can adjust how long a condition must be active rather than fiddling with the evaluation threshold.

Start by sorting the Duration column in ascending order. Then slowly scroll down the result set and look for the distribution of durations. The default duration threshold for most Advisory Conditions is around 1 minute. Below you can see the result sets’ duration values shifting from 3 to 4 minutes. Looking at the scroll bar, you will see that we are about halfway through the result set. If I increase the duration threshold from 1 minute to 4 minutes, I will effectively reduce overall volume and potential noise for this condition in half.

Viewing the duration spectrum of a specific Advisory Condition in the Event Logs
Viewing the duration spectrum of a specific Advisory Condition in the Event Logs

Next, I like to scroll to the very bottom to see what my upper bound is. These occurrences tend to be those tied to maintenance periods and/or periods of extremely heavy contention. As you begin to go through multiple iterations of this process, it may be helpful to further filter by Start Time. This excludes events from before previous tuning efforts. Do this by hovering over the Start Time column > selecting the funnel. You can do the same for the server name column when you are ready to review per target.

Filtering by Start Time in the Event Logs
Filtering by Start Time in the Event Logs

If you ever need to reset the filter, simply close out of it in the bottom left. Remove the check in the same area to keep the filter, but disable it temporarily.

Removing a filter in the Eveent Logs
Removing a filter in the Events Log

When Are the Occurrences?

After tuning by duration I move on to looking at the time of day. Sort the result set by Start Time to try and see if there’s a particular pattern for when the Advisory Condition in question is triggering. Maintenance periods are common culprits here. There may not be an obvious pattern during the first iteration of alert tuning. This is another reason I like to start with duration, as it helps reduce the dataset, making it easier to identify a pattern.

If you do identify a pattern and/or a time range that you do not benefit from being alerted on, then a Maintenance Window at the condition level can stop you from getting emails during those time periods.

To configure a Maintenance Window, navigate to the Conditions pane > single click All Targets in the Navigator pane to set the Conditions pane to the Global scope(or desired level of topology) > select Advisory Conditions in the Condition Pane > select the condition in question > Rulesets tab. Then select ‘New’ under the Window section to create a time frame in which you do not want to be alerted. Then make sure you set Window Behavior to “Don’t perform actions that occur within the window.”

Applying a Window to an Advisory Condition
Applying a Window to an Advisory Condition

Rulesets and Windows are not yet supported for the “Send to Alerting Channels” action. If you're not getting these options, that could be why. Make sure to choose a sample period that makes sense when using the Environmental Health Score. I’m personally a fan of the 3-hour view, which gives you a good range that is also narrow enough to ensure relevance.

Sample Range options for the Environmental Health Score
Sample Range options for the Environmental Health Score

What Values Are Triggering Occurrences?

Now that you’ve gone through a cycle or two of tuning based on when and how long a condition has been active, you should have a better result set to work with to identify whether now is the time to increase evaluation thresholds.

When looking at the Event Logs, you can click through each row of occurrences and see the evaluation information update at the bottom. As you’re clicking each row, make sure you’re clicking the Start Time column. If you begin clicking on each occurrence by clicking the End Time column value, you’ll see the last evaluation result that was no longer true for a given condition. That could be a bit misleading.

Event Logs, showcasing selecting rows under the Start Time column

Event Logs, showcasing selecting rows under the Start Time column

Finally, go through several occurrences to look for a possible pattern or a potential new baseline.

Conclusion

We discussed a general process for tuning out noise from Advisory Conditions. Doing so will protect your email inbox and make data shown in the Environmental Health Score and on the Dashboard via the “Send to Alerting Channels” more meaningful. It is also important that you remember to review monitoring settings regularly. As issues come up, you should reflect on what alerts you received, or alerts you didn’t receive—but should have, and add or modify conditions accordingly. This will ensure efficient and actionable alerting across your SQL Server estate.

Thwack - Symbolize TM, R, and C