Thoughts on “Best Practice” Alerts
Even outside of the IT world, the first thing that many people think of when you hear the term 'monitoring' is alerts. Whether it be a flashing light, a siren, or in the case of SolarWinds, an email in your inbox, alerts are an important part of a monitoring solution that tells you one thing: "There is something urgent you need to do."
Notice that key word there: 'DO'. When I work with customers, they often tell me the same thing: "We want visibility over everything", which may absolutely be true, but can often lead to massive scope creep within the monitoring solution. The better way to think about your monitoring, especially as you are getting started, is "What do I need to do?"
Let's look at a quick example: When a server goes down you know that you need to bring it back online. How would you do this?
- Wait to see if it restarts on its own.
- If it's physical, you can go and physically turn it off and on
- If it's virtual, you can go to the hypervisor and reset it that way.
All of these are actions that can start to influence how you think about your monitoring system should be configured. For each of these situations let's look at some things that could make your life a little easier:
- You can add a delay before the alert fires so that any devices that simply restart are ignored.
- If it's physical you can include details on the server's location within the alert email, to help the engineer with their troubleshooting. These can be added to nodes via the use of custom properties.
- If it's virtual, you can specify which hypervisor is hosting the machine, and even give a link to the management URL for quick access.
As you can see here, every decision we've made in SolarWinds is to bolster the activities that the humans on one end of this process need to carry out.
In order to support this, SolarWinds alerts have 2 main sections that can control the behaviour of alerts: The 'Condition' and the 'Action'.
The 'condition' is broken down further into the 'Context', 'Scope' and the 'Trigger'.
This is the type of device that is going to trigger the alert and can only be a single type. This could include 'Node', 'Interface', 'Group', or any other type of element. This will change what information you can use in the trigger and actions later on, so is very important.
This section allows you to filter to a subset of devices that are in the context. For example, only Cisco nodes, or only nodes in a certain site. Not only does this make it easier to avoid alert noise, but also has less of a performance hit when you have a large platform.
This section defines exactly what causes the alert to fire. This could be because the status changed to 'Down’ or 'Warning' or because a specific event fired. Again, remember that these conditions shouldn't only be what triggers the alert to fire, but they should be conditions that would cause a human to have to get involved.
This section is configured so that some sort of response to the trigger is sent out. This could be an email, which is most common, or an SMS message, or maybe it sends a message into a Microsoft Teams channel. The key thing here is that it should contain all the information required to assist the human carry out their actions to this alert.
With that covered, let's have a look at some Out-of-the-box alerts. Below is the Trigger Condition page for this out of the box alert. The first thing you'll notice is that you can't edit it, as system alerts are locked for editing. One of the first things you should do when you purchase SolarWinds is to 'Duplicate and Edit' any system alerts so that they are editable and disable the original.
Let's point and problems going through each numbered section
- The Context: No problems here.
- The Scope: Here you can see there is no scope defined, and this alert will fire for all monitored nodes. If this seems fine, as yourself: Would your actions to a Domain Controller going down in the same be the same as if a firewall went down? Because of this, your alerts should have a clear scope to them to help with troubleshooting.
- The Trigger: While 'Status is equal to down' is fine, we have to consider that we often see devices go down for a short period due to network blips or restarts and these don't necessarily warrant a human action. What we can do is add a delay to the trigger, using the section at the very bottom. This way it will only alert us if the device goes down and STAYS down.
Let's have a look at a better Condition from one of the alerts in the Prosperon Best Practices and you can see that this is very specific alert designed to tell our network engineers when an active Cisco device has a high CPU usage that has persisted longer than 10 minutes:
One key thing to note about this alert is that while it is based on CPU usage of the node, there this NO hard-coded CPU value in the alert condition. Instead, it is using the node threshold to decide whether or not to alert, allowing you more control over individual triggers for nodes.
Next up, we need to talk about alert actions as they are the part of the alert that is actually going to TELL you what is going on and, hopefully, how to respond. There are many types of alert actions you can choose, but we are going to focus on the most common which will be an email - most of these tips will be relevant for any information alert however.
First, let's look at an example of a poor, out-of-the-box alert. As you can see, it does have some information such as the node name, the status of the node and a link to the node itself but it has many problems. First, we have no concept about what this device is. Sure, there is a chance that you just recognize the name of the node and can identify it that way, but especially for people who are newer to the organisation that won't be the case. Second, we can see that there is a duplication of information across the subject and body, with no increase in detail. Finally, there is no real way to visually distinguish this alert from the many others you may be receiving on a daily basis.
Now, we can see a much better example of a best practice alert email that we use internally at Prosperon. First, we can see that the layout allows for more information to be included without increasing the confusion about the key matters. Secondly, this information can include this like what the Device Role is, the location of the device (Useful for those jobs where you have to physically go and reboot), as well as links not only to the node page but also to the acknowledgement page that you can easily access once you are working on the resolution. Finally, we have a bright red icon that allows us to, at a glance, understand that this is a critical alert and should be treated as such.
As you can see here, the same tips have been applied to a Teams notification that is created using the 'Send a GET or POST request' action. This method of alerting is becoming more and more common for customers who are using corporate instant messing services:
Tackling the issue of alerting can be a daunting task, as you must find and tread the line between receiving too many emails while also not missing anything that requires and immediate response. The first step will always be starting small, with a small set of alerts that cover your most critical nodes. By using the trigger context, we can easily specify these devices to make sure that you don't begin receiving alert noise immediately. Following this, understanding the exact criteria to trigger the alerts exactly when you need to see it, avoiding false positives as much as possible. Finally, making sure your trigger actions contain all the correct information will help speed up the resolution.
As you can see, just pivoting your thought process a little can help generate much more detailed and actionable alerts. What tricks have you implemented in your environment to help battle alert fatigue?