Version 1

    This solution provides the following functionality enhancements to the Solarwinds Orion monitoring system:

    • If a core switch fails, instead of getting hundreds of Node Down alerts we get one email with all the alerts in it. 
    • You can acknowledge all of the alerts by replying to the email with "ack" in the message body
    • People can email the monitoring system to subscribe or unsubscribe to Solarwinds alerts.
    • Email alerts for Node Down get sent with Important High which generate a Level 1 alert and thus an audible alarm on the team's Blackberries
    • After hours the first node down alert (Level 1 and thus audible) goes to the oncall person and if it is not acknowledge in 15 minute the alert is sent to the entire DC Ops team and if no one acknowledged the alert in another 15 minutes the managers are added onto the alert.
    • The oncall person rotates every Monday at 8:00 automatically.
    • A summary email is generated each hour without the Importance High/Level 1 switch to the entire DC Ops team including the managers which shows all alerts that were generated that hour.

    The way I accomplished this was to send all Solarwinds alerts to Splunk as syslog messages then run a python script on the Splunk host every five minutes to check for new Solarwinds alerts.  Three python scripts and an IMAP pluggin to Spunk for grabbing the emails provide all of the functionality. 

    Splunk is a log and message data indexing tool that has proven quite handy to us.  They have a free download that if all I was doing was using it to manage my Solarwinds alerts I would not generate enough traffic to exceed the free license.  You can download splunk at http://www.splunk.com

    For our solution we created a mailbox on our Exchange server for Splunk to accept the reply emails.  The IMAP email reader is a free download/plugin found from inside the product.

    I am using version 3.4.6 of Splunk and version 9.0 of Orion.  You will have to modify the code below appropriately with your own values.  I just wrote it and I haven't cleaned it up, so excuse the messiness of my code, lack of a settings section and lack of comments.

     

     

    The files that make up this solution on the Splunk host are as follows:

    /root/alertmon.py   - the main alerting script run every 5 minutes

    /root/ack.py   - script that acknowledges alerts on Solarwinds

    /root/amonhourly.py    - generates hourly summary of active alerts

    /root/amonsubscribers.txt    - list of subscribers to the Solarwinds alerts - users are added by emailing splunk

    /root/amononcall.txt    - on-call schedule - one email address per line (no delimiters, just a list) on-call person identified by * before email address

    /root/amonescalation.txt   - On-call escalation list (List of manager emails)

    /etc/crontab   - Runs the files in cron.five and cron.hourly

    /etc/cron.five/alertmon.sh    - Runs the alertmon script every five minutes

    /etc/cron.hourly/amonhourly.sh   - Runs the hourly summary script

     

     

    Solarwinds Configuration

    Alerts are sent to Splunk from Solarwinds by configuring an Advanced Alert for Node Down and sending a syslog message to Splunk.

     

    On Trigger Actions, the syslog message body looks like this:

    Severity=1 | AlertGUID=${AlertDefID} | NodeID=${NodeID} | Alert=${Alert} | Nodename=${NodeName} | IP=${Node.IP_Address} | Ack=${Acknowledged} | MonitoringAlert

     

    On the Alert Escalation tab the checkbox is check to "Execute this alert repeatedly while the alert is triggered" with it set to repeat every 15 minutes.

     

    The Reset Actions has a syslog message body of:

    RESET: AlertGUID=${AlertDefID} | NodeID=${NodeID} | Alert=${Alert} | Nodename=${NodeName} | Ack=${Acknowledged} | MonitoringReset

     

     

    Alert Flow

    When an alert is generated in Solarwinds, a syslog message is sent to Splunk with the alert data in the message body.  The Node Down alert is the primary alert sent through this process, but other alerts can be sent to splunk as well. 

     

    Note: The code currently deduplicates any NodeID/Nodename found so this system may have to be changed if it is to handle multiple alerts and different kinds of alerts.

     

    Every five minutes a cron job on the Splunk host runs the alertmon.py script which queries Splunk for "MonitoringAlert".  This is the search performed by the script to pick up new alerts:

    search host=solarwinds monitoringalert Ack="Not Acknowledged" startminutesago=5 | dedup NodeID| sort "Severity", "AlertGUID", "NodeID"

     

    During the hours of 8:00AM to 6:00PM M-F, if one or more alerts are found an email is sent to the subscribers list. 

     

    After hours a message is sent to just the on-call person (amononcall.txt).  If it is not acknowledged within 15 minutes it goes to the entire subscribers list. Managers are added if it is not acknowledged 15 minutes after that.   Escalation level is monitored by placing the "escalation_level=3,2 or 1" tag in the message body and including splunk in the "To" list. Escalation level 1 is the highest level.

     

    Every time the alertmon.py script runs (every 5 minutes) it also checks for emails sent to splunk.  It checks for escalation level for messages after hours and it checks for alert acknowledgements. 

     

    If an email sent back to splunk has "ack" as the first three letters of any line of text in the email it then parses the message one line at a time looking for the alert GUID and then subsequent NodeID entries. Using the functions in the ack.py script, Splunk sends and acknowledgement for each alert via HTTP request to Solarwinds (providing the Alert GUID and NodeID).

     

    The last thing the script does is send out an email to the recipients list if any "Not Acknowledged" alerts have been found in the last five minutes.

     

    Every hour a separate script called amonhourly.py sends an email to everyone in the subscribers (amonsubscribers.txt) and escalation (amonescalation.txt) lists with a summary of all alerts triggered and reset in the last hour.  This alert is not flagged as Level 1/Importance High.

     

    Instructions to End Users

     

    Users can subscribe or unsubscribe to the Solarwinds monitoring alerts by sending an emails to splunk@mydomain.com with the following in the Subject:

    To subscribe:  "sub SM"

    To unsubscribe:  "unsub SM"

     

    You can acknowledge alerts by replying to splunk with ack (caps or lower case) in the body. 

     

    After hours the oncall person gets the first message. Then subsequent messages go to everyone if it is not ack'd in 15 minutes.  After 30 minutes the managers are added to the alerts.  The on-call notification rotates to the next person on-call at 8:00AM every Monday.

     

    You can setup your BB to beep on Severity 1 messages by doing two steps:

    1. Setup your profile to alert for level 1 messages inside      your BB
    2. Using the BB Desktop Manager, create a filter for      Important messages
      1. Click on Email Settings/Redirector Settings (depending       on version of BB Desktop)
      2. Go to Filters tab
      3. Create a new filter
        1. Call the filter Important or Severity 1
        2. Check the Importance box and set it to High
        3. Check the box for Forwarding        as Level 1
      1. Save your new filter

    NOTE: Consider also filtering on emails from  splun@mydomain.com if you don’t want all emails set with Importance High to generate an alert.