Tips & Tricks: Alerting Aggregation Period and Methods
Hello everyone! Today, I’ll share some valuable tips and tricks related to SWO alerting, specifically focusing on the mysterious “During Last” part of the condition.
In SWO, we offer various aggregation methods and allow you to choose the length of the aggregation period. The six aggregation methods are: Minimal, Maximal, Average, Sum, Last, and Count.
Let’s clarify what the aggregation period means. Contrary to common misconceptions, it does not imply that we’ll wait for the specified amount of time before triggering the alert. Instead, when we set an aggregation period (e.g., 1 hour), we consider data received during the last one hour and apply the chosen aggregation method.
For all the examples below, I’ll use the following condition:
responseTime > 5000 ms
Let’s break down the behavior of three common aggregation methods:
- Minimal:
- The alert triggers when the minimum metric value over the last 1 hour matches the threshold.
- Maximal:
- The alert triggers during peaks (marked with yellow and orange dots) where the metric exceeds the threshold.
- It remains active for 1 hour.
- Average:
- The alert triggers somewhere around the orange dot peak and will reset in approximately one hour (if future numbers remain below the threshold).
- It also stays active for 1 hour.
- Sum:
- The Sum aggregation method calculates the total sum of the given metric values within the specified aggregation window (e.g., 1 hour).
- It’s particularly useful for metrics related to cumulative events, such as error counts, resets, or other similar occurrences.
- However, it may not be suitable for metrics like responseTime, where the sum of values might not provide meaningful insights.
- Last:
- The Last aggregation method considers only the most recent value (ignoring the aggregation window).
- It can be handy when you want to trigger an alert based on the latest data point.
- Be cautious: If the metric value is consistently close to the threshold, the alert may frequently send trigger and reset notifications.
- Count:
- The Count aggregation method focuses solely on the number of metric collections (regardless of the actual value).
- An interesting use case for Count is when you want to raise an alert when the metric stops reporting. For example, if the metric value equals 0, it indicates that the metric has ceased reporting.
Regarding the aggregation period, a shorter period (e.g., 5 minutes) resets the alert sooner. You may see alert flickering (many triggers and resets) and overall, we recommend to have larger periods – at least 1 hour for the best experience. A longer period (e.g., a few days) triggers the alert once (e.g., in the first peak) and maintains its activity throughout the displayed timeframe.