2 of 2 people found this helpful
Ah, "standard alerts", how I both love and hate thee.
First, I'd be remiss if I didn't direct your attention to the free "Monitoring 201" ebook, which delves into this exact topic - what alerts should be standard, how the out-of-the-box alerts were (for the most part) NEVER intended to be used as-is, and how to create alerts that are meaningful, actionable, and useful. You can find it here: eBook Resources – SolarWinds
Second, I'd be equally remiss if I didn't direct your attention to the Alert Lab forum, where such things are discussed every day: Alert Lab
And third, this has been the topic of more than a few video discussions. Here is but one example: All About Alerts - SolarWinds Lab #42 - YouTube
But that's just pre-amble. Let me try to take a few short swipes at some answers.
Standard alerts, best practices alerts, common alerts, etc
On the one hand, I could be the curmudgeonly contrarian and say that no alert is standard, it all has to do with your particular environment, so stop looking for someone else to do your work for you. However, that would be unkind.
Depending on the device type you are monitoring, you probably can come up with a list of the types of alerts you want.
- device availability (up/down, part 1) - possibly grouped by device class (server, router, switch), device ownership (web team, database team, Site XYZ local support team, etc), device class of service (business critical, dev, etc), or a combination of those groupings and more. But the point is that you will need some kind of "it is down" alerting, along with the "it's all better" message when it comes back up.
- sub-component availability (up/down, part 2) - this is for all the little peopl... I mean elements on the device. Disks, interfaces, sub-interfaces, controllers, and the like. With the same caveate that you will likely want to group them based on some set of criteria.
- Something something disk. You'll likely want one (or more) types of alerts on disk usage. The most obvious being disk space, but also IOPs, disk errors, and potentially others.
- something something CPU. You'll also likely want an alert that indicates a machine is using too much CPU. However, it's frighteningly easy to get this wrong by not being detailed enough. As I've mentioned ad nauseum elsewhere on thwack, the technical term for a machine that has a high CPU utilization but is keeping up with it's workload is: "correctly sized". Also note that what constitutes a CPU issue is wildly different for windows servers versus linux versus routers versus IoT.
- something something RAM. Everything I just said about CPU, except for RAM.
- Salt to taste. This is where everything else goes. Your network error alerts may go here. Or your network team might say that they don't care about such trivialities. The same goes for firewall errors. Or F5 configuration changes. Etc. This is where the "work of the work" of being a monitoring specialist comes in,
As for the out-of-the-box alerts, here's my observation after 20 years implementing monitoring, and a couple of years working "on the inside" for a vendor: the alerts we include are examples, not best practices. We include alerts to show HOW a technique can be done. A good example is the "High CPU, top 10 processes" alert. Should you alert when CPU > 90%? That's probably too simplistic. But we created a simple-to-understand alert so you could then see how we implemented the "grab the top 10 processes at the time of the alert" technique. I view sample alerts from the vendor the same way a chef would view a recipe on the back of the box of soup mix. A cute, if somewhat simple, suggestion that is meant to inspire you to think differently about how to use that soup mix.
Honestly, the more important question than "what are the best standard alerts to set up" are questions that include:
- Does someone actually want this alert, or am I creating it because I think it's expected.
- Is this alert specific enough that it points to a real issue?
- when I trigger this alert, what will someone (a human) DO in response
- spoiler: if the answer is "nothing", don't create the alert.
- when I trigger this alert, what can monitoring do automatically that would either alleviate the human from doing anything, or help the human to solve the problem faster
- Is there a documented procedure to respond to this alert, so that when the primary recipient is on vacation, the next sucke... volunteer has a way of knowing what needs to be done?
Even if asking those questions causes you to create fewer alerts, the ones you DO create will be more useful and will create a level of trust in your organization that when an alert comes in, it is relevant, urgent, and valid. That, in turn, will get people in your org to solicit your help in creating the alerts they NEED.
And THAT is the best set of "standard" alerts you could have. The ones your organization has self-identified as being important.
1 of 1 people found this helpful
To calm down those floods of alerts for remote sites you probably want to set up some dependencies,
This technique is a favorite of mine for managing lots of dependencies and not having to mess with the GUI as much, but it requires some familiarity with SQL
As far as standard deployments go, the out of the box conditions are alright in many cases but I usually redo anything that is set to a hard coded threshold (like cpu load at 90%). I switch it to checking if that metric is over the critical threshold and put some kind of time constraint on it, like it has to be high for at least 30 minutes or something before I alert on it. I usually try to correlate the alerts to the idea of "what kind of things will we ACTUALLY get up out of bed to fix, and what of stuff can wait until tomorrow for me to see it on a dashboard or report?"
There aren't many situations that apply to most environments that aren't covered by the ootb alerts, I just find I have a lot of preferences regarding how I think we should be given that information.