My alerts are getting quite diverse to say the least.
Is there a way to nest or group alerts into single logic where say:
I want to trigger an alert if Linux fixed disk is over 90% or if the server is down for more than 10 minutes.
Or for a Windows example, I alert of if a component has a bad status and I have another alert if the server is down for 10 minutes.
I have these set individually but I would think it of benefit to be able to group these base on server scope. It wouldn't be much of an issue but management of 3 different sets of alerts going to 50 different groups is becoming a bit of a task. I keep coming up empty handed in searching and at first I thought complex conditions would have been the trick but if I am reading the documentation correctly (videos as well) it's an AND statement rather than an OR statement. Should I take a step back and try to approach this from an Application Template perspective and hope the template is out there?
Solved! Go to Solution.
Creating a Template for these would allow you to create a single alert definition for it. I have not created one with mixed types of items like you are looking at. Given what I have seen on Thwack and built-ins I am sure it is possible.
tomainnelli, thank you for your response. So are you thinking something like building it as a component in a template and simply alerting on components rather than say volume? This is also an option but then I come up with the question how to create a component script that monitors for any fixed disk and a component that monitors for server response. Hmm.
Why would you want to raise the same alert for a disk space issue as you would for a node down issue? The troubleshooting process would be entirely different in each case and depending how your company slices things up it could be different teams handling the issue. If you already have triggered this alert because a disk was nearly full you wouldn't get an additional message that the server crashed, so in theory someone might not be investigating because they think low free space isn't urgent, but node down should be treated urgently.
I have a pretty standard set of alerts I build with most clients and we rarely need more that 35 or so actual alert definitions to cover all cases that matter across all modules and teams.
The most common issues I find is when people don't make full use of custom properties and make their alert logic too specific.
The most generic example would be doing something like having several flavors of node down alerts, a "network switch down" separate from "windows server down" and "linux server down" is usually not required. If the alert action is basically the same, but the only difference is who the message goes to then you can use an OwnerEmail custom property to create logic like "if any node goes down then send an email to the owner of that node."
People can also go pretty crazy on the SAM app monitor side as well with logic like "if such and such exchange component on this server goes down then email the exchange person" with 100 variations on what component and who to notify. I tend to consolidate it down to a single alert condition. If any application is down notify the address in the Application Owner custom property.
So if your alert scheme is becoming overwhelming you might see where you can condense them down from granular logic to more generic situations through the use of custom properties or the built in threshold fields.
How many alert rules do you have in place right now? How many messages does your orion send out most days? How big is the environment in terms of total nodes?
mesverrum, thank you for your response. I suppose I am just trying to untangle everything and get it all straight. We have just north of 900 servers and 45+ different groups that are notified separately as well as a tier 2 24/7 incident response center that currently get any volume on any server over 98%, server down, and select services for mission critical servers. I'm about an 8th of way through and have 216 alerts currently on. I have 3 sets of alerts configured for the working groups currently (windows as an example):
1. Trigger volume alert if any get over 95% and send an email.
2. Trigger node down alert if a server node goes down and send an email.
3. Trigger alert when a component goes critical (say window server service down) and send an email with the component variable that displays what happened.
Based on what you said, do I have the component one wrong and it will only key once? I was under the impression that it would key for any component that goes critical. Ha ha, I feel like I was heading the right direction but now I am questioning my very being.
I don't see why you need 216 alert definitions to cover the 3 cases, is the difference just that you create a different alert depending on who gets notified, but otherwise the alerts are the same?
Assuming that you have so many alerts because you need to control where the alerts are going you should probably read this KB and see if it can be implemented
I apologize I should have clarified. An example would be SQL group gets one set of alerts, Linux gets another set, on and one, application by application or cluster by cluster depending on scope.
Right, thank you. I will review read the article and see if it helps. I suppose I was hoping for alert logic to be able to say "if anything is wrong with the set of nodes (all volumes, components, or down)" it would send an email indicating the issue that was presented by the database. I suppose I could attempt what was talked about above where I have component monitor for the things I need to add. Just getting wrapped around the axel trying to grab any drives or node overall status as SolarWinds seems to treat those differently. Thanks to you two for your help so far.
I thought you were trying to solve a problem of standardizing a collection of alerts so that You could easily deploy a collection against a group of servers. mesverrum makes a good point in using the custom property to send the emails where they need to go. That has been very useful for us.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.