15 Replies Latest reply on Aug 27, 2019 1:30 PM by thwackdude

    Nesting or Grouping Alerts Clarification

    thwackdude

      My alerts are getting quite diverse to say the least.

      Is there a way to nest or group alerts into single logic where say:

      I want to trigger an alert if Linux fixed disk is over 90% or if the server is down for more than 10 minutes.

      Or for a Windows example, I alert of if a component has a bad status and I have another alert if the server is down for 10 minutes.

       

      I have these set individually but I would think it of benefit to be able to group these base on server scope.  It wouldn't be much of an issue but management of 3 different sets of alerts going to 50 different groups is becoming a bit of a task.  I keep coming up empty handed in searching and at first I thought complex conditions would have been the trick but if I am reading the documentation correctly (videos as well) it's an AND statement rather than an OR statement.  Should I take a step back and try to approach this from an Application Template perspective and hope the template is out there?

        • Re: Nesting or Grouping Alerts Clarification
          tomiannelli

          Creating a Template for these would allow you to create a single alert definition for it. I have not created one with mixed types of items like you are looking at. Given what I have seen on Thwack and built-ins I am sure it is possible.

            • Re: Nesting or Grouping Alerts Clarification
              thwackdude

              tomainnelli, thank you for your response.  So are you thinking something like building it as a component in a template and simply alerting on components rather than say volume?  This is also an option but then I come up with the question how to create a component script that monitors for any fixed disk and a component that monitors for server response.  Hmm. 

            • Re: Nesting or Grouping Alerts Clarification
              mesverrum

              Why would you want to raise the same alert for a disk space issue as you would for a node down issue?  The troubleshooting process would be entirely different in each case and depending how your company slices things up it could be different teams handling the issue.  If you already have triggered this alert because a disk was nearly full you wouldn't get an additional message that the server crashed, so in theory someone might not be investigating because they think low free space isn't urgent, but node down should be treated urgently.

               

              I have a pretty standard set of alerts I build with most clients and we rarely need more that 35 or so actual alert definitions to cover all cases that matter across all modules and teams.

              The most common issues I find is when people don't make full use of custom properties and make their alert logic too specific.

              The most generic example would be doing something like having several flavors of node down alerts, a "network switch down" separate from "windows server down" and "linux server down" is usually not required.  If the alert action is basically the same, but the only difference is who the message goes to then you can use an OwnerEmail custom property to create logic like "if any node goes down then send an email to the owner of that node."

               

              People can also go pretty crazy on the SAM app monitor side as well with logic like "if such and such exchange component on this server goes down then email the exchange person" with 100 variations on what component and who to notify.  I tend to consolidate it down to a single alert condition.  If any application is down notify the address in the Application Owner custom property.

               

              So if your alert scheme is becoming overwhelming you might see where you can condense them down from granular logic to more generic situations through the use of custom properties or the built in threshold fields.

               

              How many alert rules do you have in place right now?  How many messages does your orion send out most days?  How big is the environment in terms of total nodes?

                • Re: Nesting or Grouping Alerts Clarification
                  thwackdude

                  mesverrum, thank you for your response.  I suppose I am just trying to untangle everything and get it all straight.  We have just north of 900 servers and 45+ different groups that are notified separately as well as a tier 2 24/7 incident response center that currently get any volume on any server over 98%, server down, and select services for mission critical servers.  I'm about an 8th of way through and have 216 alerts currently on.  I have 3 sets of alerts configured for the working groups currently (windows as an example):

                  1. Trigger volume alert if any get over 95% and send an email.

                  2. Trigger node down alert if a server node goes down and send an email.

                  3. Trigger alert when a component goes critical (say window server service down) and send an email with the component variable that displays what happened.

                   

                  Based on what you said, do I have the component one wrong and it will only key once?  I was under the impression that it would key for any component that goes critical.  Ha ha, I feel like I was heading the right direction but now I am questioning my very being.

                    • Re: Nesting or Grouping Alerts Clarification
                      mesverrum

                      I don't see why you need 216 alert definitions to cover the 3 cases, is the difference just that you create a different alert depending on who gets notified, but otherwise the alerts are the same?

                        • Re: Nesting or Grouping Alerts Clarification
                          mesverrum

                          Assuming that you have so many alerts because you need to control where the alerts are going you should probably read this KB and see if it can be implemented

                          https://support.solarwinds.com/SuccessCenter/s/article/Use-Custom-properties-when-sending-email-alerts

                            • Re: Nesting or Grouping Alerts Clarification
                              thwackdude

                              Right, thank you. I will review read the article and see if it helps. I suppose I was hoping for alert logic to be able to say "if anything is wrong with the set of nodes (all volumes, components, or down)" it would send an email indicating the issue that was presented by the database.  I suppose I could attempt what was talked about above where I have component monitor for the things I need to add.  Just getting wrapped around the axel trying to grab any drives or node overall status as SolarWinds seems to treat those differently.  Thanks to you two for your help so far.

                            • Re: Nesting or Grouping Alerts Clarification
                              thwackdude

                              I apologize I should have clarified.  An example would be SQL group gets one set of alerts, Linux gets another set, on and one, application by application or cluster by cluster depending on scope.

                                • Re: Nesting or Grouping Alerts Clarification
                                  tomiannelli

                                  I thought you were trying to solve a problem of standardizing a collection of alerts so that You could easily deploy a collection against a group of servers. mesverrum makes a good point in using the custom property to send the emails where they need to go. That has been very useful for us.

                                  1 of 1 people found this helpful
                                    • Re: Nesting or Grouping Alerts Clarification
                                      thwackdude

                                      Apologies.  Essentially I am hung up on, I can create an alert for components, another for status, and another for volumes but creating 3 alerts for each group of admins is going to get messy which I have accepted at this point.  I believe I need to figure out how to change any volume into a template component and status into a component and place it in a template, that way i can say trigger alert if a component goes down or has a certain value and have one alert.  The only issue I can find with custom properties at this time is multiple custom properties for each node (I believe it's a feature request).  The Linux group is interested in the OS and hardware health, where the Oracle group is interested in the same server but with the added caveat of Oracle monitoring.

                                       

                                      What I was hoping was possible:

                                      Alert on Volume

                                           Scope: node=server or node type contains Linux or redhat

                                           Condition: volume % avail <= 10% and = fixed disk

                                       

                                      OR

                                       

                                      Alert on Node

                                           Scope: node=server or node type contains Linux or redhat

                                           Condition: Node status = down

                                       

                                      Alert Actions:  Send email with issue to owner.

                                        • Re: Nesting or Grouping Alerts Clarification
                                          mesverrum

                                          The alert system is intrinsically tied to the object type you are alerting on, the GUI will not really let you cross from one object type to another in a single alert.  You could do that in SQL, but its clunky and unnecessarily complex to even bother.

                                           

                                          For the situation like you are describing I typically do this:

                                           

                                          Assuming Nodes have a custom property called OwnerEmail

                                           

                                          *not filtering  against vendor or team or anything else, basically any time a node goes down this one alert covers it, you might come up with a little more filtering to exclude from alerting on test or dev or whatever but the goal is to avoid making things overly granular.  If it needs to be alerted at all the one rule should cover it for ALL cases.

                                           

                                          Trigger action

                                          Just populate the OwnerEmail custom property for all the nodes based on whatever teams care to know when a given node has problems. If multiple teams need to know then you can just semi colon separate them and you can notify everyone you want .  Alternatively, some companies will set up several custom properties like "systemowner" and "applicationowner" and cc the appowner on system down messages, and not even include the system owner on things like CPU alerts for those cases the system team doesn't care about that kind of thing.  You can tweak this kind of structure for whatever your use case is. 

                                          In any case leveraging the properties like this you can have one single node down alert that covers all types of hardware and all teams.  Absolutely no need to build a separate alert on the same objects for different teams.

                                           

                                           

                                          You can then apply the same kind of technique to the application template alerts by adding an owneremail custom property to the sam application monitors as well.  Just create a single "Component down" and if any component/service/script/log monitor is down then it just emails whoever the owner was.  One rule covers all teams, all apps.

                                           

                                          For volumes you just set it to be any volume that breaches it's space used threshold alerts the node owneremail.

                                           

                                          So in this way you have just 3 alerts replacing what sounds like dozens of copies of what you have in place right now.

                                           

                                           

                                           

                                           

                                           

                                          Getting a little further into the rabbit hole you can automate the process of filling in the owner custom property by having alert rules that have logic like

                                          "if a node has a vendor of linux and the owneremail property is null have an alert action to set the owneremail custom property to be linuxadmins@mycompany.com"

                                           

                                            • Re: Nesting or Grouping Alerts Clarification
                                              thwackdude

                                              I want to make sure I am understand what you are saying.  So in this instance, I create an alert for when a node goes down, I set the action to email custom property, and if a SharePoint server (with custom property of SharePointOwners) goes down it will email a SharePoint admin and if a Linux server (custom property of LinuxOwners) goes down it will email a Linux admin; thus, this one alerts will filter downed nodes to their respective owners? 

                                                • Re: Nesting or Grouping Alerts Clarification
                                                  mesverrum

                                                  sounds like you've got the idea, although I would avoid being as granular as sharepointowners and linuxowners, that is very likely to require you to create dozens of properties.  In most cases just "owners" works, but in depending how you split your duties you might have applicationowners who want to know if the node is down and need to know about things like cpu or memory utilization and application health and disk space and such, where there is often a system team who doesn't normally respond to issues about some or all of those metrics and only care when the system is hard down or has OS related issues.  You would know the best way to slice it up, but once again you want to try to keep the property generic enough that you can use a single property to cover lots of cases instead of creating 60 different properties where any given server only ever uses one or two of them at a time.