4 Replies Latest reply on May 10, 2016 11:15 AM by akhasheni

    Alerting: Theories & Expectations


      I wanted to open a discussion around alerting with all of you who have a hand in it - which is pretty much all of us.


      I'm running into some perceived walls; I have a pretty firm expectation that alert = action.

      So if you want "need" an alert for something - it signifies someone needs to do something and even if they don't - someone is darn sure going to acknowledge that alert and if we're being collaborative; add a quick reference note to it.


      Whereas I'm currently working and interacting with some new individuals and teams that are very firmly in the mindset, that an alert = reminder or is strictly informational because - it can do that.


      I'm interested in hearing what your stance is or ideas about:


      • What you do and how with alerts?
        • Are you categorizing them?
        • Are they just informational emails?
        • Are they going in your NetPerfMon Log (Please sweet Gods of Network Monitoring tell me you do this)
        • Do they have actions that "fix" or start the "fixing" of a known bad condition?
      • Who gets alerts?
        • Internal customers?
        • External customers?
        • How are those different; by what they look like and the expectation of who'd doing what when they receive them?
      • What is the purpose of an "alert" as it is today?
        • Notification of an element or component or nodes' undesirable condition?
        • Is action always required or are Information-Only alerts ok?
        • Actions that resolve an issue?
      • How do we satisfy these expectations of seemingly passive information sharing?
        • Aggressive scheduled reporting?
        • Telling them to log in an look?
        • Mobile Admin Push Notifications?
        • Some kind of integration with your service ticketing/work-order management systems for "follow-up" items?


      I'm interested to hear what you geeks have to say about it!


        • Re: Alerting: Theories & Expectations

          Alerts started as notifications that were displayed on Dashboards, sent as email to the team who needs to fix it, and an incident ticket was created for each event. They were categorized to identify the first responder and the applications impacted. They do go to out NetPerfMon Log. Some do have "clean up" or "fix" actions, but even if that task is triggered, notification of the successful remediation needs to be verified.


          We don’t have much going to External customers, typically the internal system owner is responsible for communication of any outages they may have experienced. That said, stakeholders can view dashboards if they wish, those may have alert data on them.  Internal customer communication falls into 2 groups. One to on call devices and our NOC team which are no frills text only notices. (the NOC should have Orion up, and phones do more with less). The next is the communication to a system analyst, which includes HTML formatting, links and becoming more important every day, branding and signatures that identify this as an official IS communication. When people get it, they know that it is from us, and not phishing. (Targeted phishing is no longer a thing other people deal with, we see it too frequently).


          These alerts are for communicating outages that have happened, or are about to. When in doubt out NOC will triage get it and triage next steps and priority. In this used case information is not enough, it has to be important enough to get people out of bed at 3 am. When it fires either the alert needs to be adjusted, or a system gets fixed.


          When I said that alert used to be all that above passive information sharing needed to be improved. I still call everything above this paragraph Alerts, and nothing below. That said, I use the alert engine to fire off informational posts to Slack using Leon’s example (The Incomplete Guide to Integrating SolarWinds Orion and Slack ) 2 months ago. To get the alerts out of the way, they are separate from the true alerts, and I have a script that automatically acknowledges them. The posts to Slack are sorted and sent to all interested teams, and they can use slack to notify them as each person sees fit, either to the computer, the phone, or both.  We throw lots of data here, from more systems all the time. It’s starting to catch on like wildfire.

          • Re: Alerting: Theories & Expectations

            An alert, to me, is a status report, usually that of change (in a condition), or an affirmation (of an existing condition).


            Some alerts are purely informational, some require an action. Sometimes the action can be devised from the alert alone, sometimes - only from the context. However much I'd like to add a specific action or even a resolution (if the alert is about an issue) - it just isn't mathematically possible as no single rule can address all possible conditions in all but completely closed systems (Gödel's incompleteness theorems).


            Examples of alerts that may not need an action:

            • "Good morning. This is a routine 8am alert to let you know that your Solarwinds server is alive, breathing, and is able to send out alerts. No action is necessary."
            • "Average CPU load on server XXX has been over 90% for the past 2 hours. Address if this is considered abnormal - e.g. a transcode process is stuck."


            The bottom line however is that an alert is a status report - nothing more, nothing less. One can attempt and perhaps be successful in pairing an alert with an action, assignment to a responsible party, resolution, etc. - yet it's clear it all depends on the context.


            Just my 2 kopecks...

              • Re: Alerting: Theories & Expectations

                I understand where you are coming from, but to wrap my mind around this; what's your environment like? How many nodes and elements are you watching? Do the alerts get emailed, or just end up in an actual report? Do you have some one reacting to each alert appropriately? Are they routed to the correct people, or broadcast? Also, are you a 24/7 business? How many people work with Solarwinds information?


                In my world the default state is that everything is running correctly, and we have the correct capacity to keep it that way for the foreseeable future. If something told me that an app, Solarwinds or otherwise, was working correctly, it wouldn't be a big deal, but I knew that already since I find out when its not. (We have a small SCOM instance which alerts us on Solarwinds to be safe). I can deal with a notice confirming what should be working, but I can't do it for every system. If every system sent out a status report at 8 am all I would do all day is review them to find what's missing.


                With your example of high CPU load, you say that it may not need an action, I disagree. Even if its "ok" the alert forces you to evaluate the situation. Is that transcode process stuck? That's a great thing to check, but if you didn't have to verify it, you wouldn't send the alert. In a perfect world you detect the possible fault, automate the response to it, validate the concern is mitigated and then send a notice to whatever system track issues. I wish I lived in a world like that more than I do.


                I completely agree that the alerting engine can provide information that has value outside of "this is broken, go fix it". I think the struggle is what to pull from Solarwinds, and where should that data go? Every IT department has to handle this in a way that works for them, I just find value in hearing what others do.

                1 of 1 people found this helpful
                  • Re: Alerting: Theories & Expectations

                    How many nodes and elements are you watching? Do the alerts get emailed, or just end up in an actual report? Do you have some one reacting to each alert appropriately? Are they routed to the correct people, or broadcast? Also, are you a 24/7 business? How many people work with Solarwinds information?

                    250 nodes, 500 volumes, 700+ interfaces, 500+ component monitors. Alerts get emailed, some - slacked. Yes we have eyes on incoming email alerts and on Solarwinds home page. It's a team of several sysadmins working on multiple shifts. Not yet fully a 24-7 coverage - but close. Several.


                    I understand where you're coming from, too. Yes, the idea is to automate ourselves out of the job while dramatically improving efficiency. To do that successfully, one must model the environment: model (write up) system's expected behavior, what constitutes a departure from one, common ways to bring it back into compliance, make adjustments to avoid similar issues for good. Modelling is really only possible on things with a certain level of predictability - so this, again, is where (the sometimes unpredictable) context steps in and screws some things up and where we need humans to make (unpredictable at times) decisions.


                    The way we model it: define what's critical and what isn't*, separate systems and applications into areas of responsibility*, design alerts so it's clear how critical they are*, what team should address it*, train people on how to handle common incidents*, write up SOPs*. We're still in the early stages of making it truly work - but we're making progress. Understanding the big picture (what my company, team, department each is supposed to do, and what constitutes its efficient operation) - is crucial to defining how our monitoring and alerting system should operate.


                    (*) Where possible.


                    Performance tuning, capacity planning, diagnostics and analytics are fairly important things we get out of Solarwinds (or any other monitoring systems we'd employ) - besides alerting.


                    Thanks for asking all these questions - they help me understand what we (need to) do better.


                    Screen Shot 2016-05-10 at 8.57.10 AM.png

                    Screen Shot 2016-05-07 at 5.04.08 PM.png