This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Acknowledge Solarwinds Orion alerts by email (using Splunk) - revised

UPDATED WITH CORRECTIONS TO SCRIPTS
(I broke the scripts preparing them for upload.  This version corrects the broken scripts)

This solution provides the following functionality enhancements to the Solarwinds Orion monitoring system:

  • If a core switch fails, instead of getting hundreds of Node Down alerts we get one email with all the alerts in it. 
  • You can acknowledge all of the alerts by replying to the email with "ack" in the message body
  • People can email the monitoring system to subscribe or unsubscribe to Solarwinds alerts.
  • Email alerts for Node Down get sent with Important High which generate a Level 1 alert and thus an audible alarm on the team's Blackberries
  • After hours the first node down alert (Level 1 and thus audible) goes to the oncall person and if it is not acknowledge in 15 minute the alert is sent to the entire DC Ops team and if no one acknowledged the alert in another 15 minutes the managers are added onto the alert.
  • The oncall person rotates every Monday at 8:00 automatically.
  • A summary email is generated each hour without the Importance High/Level 1 switch to the entire DC Ops team including the managers which shows all alerts that were generated that hour.

The way I accomplished this was to send all Solarwinds alerts to Splunk as syslog messages then run a python script on the Splunk host every five minutes to check for new Solarwinds alerts.  Three python scripts and an IMAP pluggin to Spunk for grabbing the emails provide all of the functionality. 

Splunk is a log and message data indexing tool that has proven quite handy to us.  They have a free download that if all I was doing was using it to manage my Solarwinds alerts I would not generate enough traffic to exceed the free license.  You can download splunk at http://www.splunk.com

For our solution we created a mailbox on our Exchange server for Splunk to accept the reply emails.  The IMAP email reader is a free download/plugin found from inside the product.

I am using version 3.4.6 of Splunk running on Redhat Enterprise 4 and version 9.0 of Orion.  You will have to modify the code below appropriately with your own values.  I just wrote it and I haven't cleaned it up, so excuse the messiness of my code, lack of a settings section and lack of comments.

 

The  ackByEmail.tgz file should be copied to the Splunk server.

The files that make up this solution on the Splunk host are as follows:

/root/alertmon.py   - the main alerting script run every 5 minutes

/root/ack.py   - script that acknowledges alerts on Solarwinds

/root/amonhourly.py    - generates hourly summary of active alerts

/root/amonsubscribers.txt    - list of subscribers to the Solarwinds alerts - users are added by emailing splunk

/root/amononcall.txt    - on-call schedule - one email address per line (no delimiters, just a list) on-call person identified by * before email address

/root/amonescalation.txt   - On-call escalation list (List of manager emails)

/etc/crontab   - Runs the files in cron.five and cron.hourly

/etc/cron.five/alertmon.sh    - Runs the alertmon script every five minutes

/etc/cron.hourly/amonhourly.sh   - Runs the hourly summary script

 

 

Solarwinds Configuration

Alerts are sent to Splunk from Solarwinds by configuring an Advanced Alert for Node Down and sending a syslog message to Splunk.

 

On Trigger Actions, the syslog message body looks like this:

Severity=1 | AlertGUID=${AlertDefID} | NodeID=${NodeID} | Alert=${Alert} | Nodename=${NodeName} | IP=${Node.IP_Address} | Ack=${Acknowledged} | MonitoringAlert

 

On the Alert Escalation tab the checkbox is check to "Execute this alert repeatedly while the alert is triggered" with it set to repeat every 15 minutes.

 

The Reset Actions has a syslog message body of:

RESET: AlertGUID=${AlertDefID} | NodeID=${NodeID} | Alert=${Alert} | Nodename=${NodeName} | Ack=${Acknowledged} | MonitoringReset

 

 

Alert Flow

When an alert is generated in Solarwinds, a syslog message is sent to Splunk with the alert data in the message body.  The Node Down alert is the primary alert sent through this process, but other alerts can be sent to splunk as well. 

 

Note: The code currently deduplicates any NodeID/Nodename found so this system may have to be changed if it is to handle multiple alerts and different kinds of alerts.

 

Every five minutes a cron job on the Splunk host runs the alertmon.py script which queries Splunk for "MonitoringAlert".  This is the search performed by the script to pick up new alerts:

search host=solarwinds monitoringalert Ack="Not Acknowledged" startminutesago=5 | dedup NodeID| sort "Severity", "AlertGUID", "NodeID"

 

During the hours of 8:00AM to 6:00PM M-F, if one or more alerts are found an email is sent to the subscribers list. 

 

After hours a message is sent to just the on-call person (amononcall.txt).  If it is not acknowledged within 15 minutes it goes to the entire subscribers list. Managers are added if it is not acknowledged 15 minutes after that.   Escalation level is monitored by placing the "escalation_level=3,2 or 1" tag in the message body and including splunk in the "To" list. Escalation level 1 is the highest level.

 

Every time the alertmon.py script runs (every 5 minutes) it also checks for emails sent to splunk.  It checks for escalation level for messages after hours and it checks for alert acknowledgements. 

 

If an email sent back to splunk has "ack" as the first three letters of any line of text in the email it then parses the message one line at a time looking for the alert GUID and then subsequent NodeID entries. Using the functions in the ack.py script, Splunk sends and acknowledgement for each alert via HTTP request to Solarwinds (providing the Alert GUID and NodeID).

 

The last thing the script does is send out an email to the recipients list if any "Not Acknowledged" alerts have been found in the last five minutes.

 

Every hour a separate script called amonhourly.py sends an email to everyone in the subscribers (amonsubscribers.txt) and escalation (amonescalation.txt) lists with a summary of all alerts triggered and reset in the last hour.  This alert is not flagged as Level 1/Importance High.

 

Instructions to End Users

 

Users can subscribe or unsubscribe to the Solarwinds monitoring alerts by sending an emails to splunk@mydomain.com with the following in the Subject:

To subscribe:  "sub SM"

To unsubscribe:  "unsub SM"

 

You can acknowledge alerts by replying to splunk with ack (caps or lower case) in the body. 

 

After hours the oncall person gets the first message. Then subsequent messages go to everyone if it is not ack'd in 15 minutes.  After 30 minutes the managers are added to the alerts.  The on-call notification rotates to the next person on-call at 8:00AM every Monday.

 

You can setup your BB to beep on Severity 1 messages by doing two steps:

  1. Setup your profile to alert for level 1 messages inside your BB
  2. Using the BB Desktop Manager, create a filter for Important messages
    1. Click on Email Settings/Redirector Settings (depending on version of BB Desktop)
    2. Go to Filters tab
    3. Create a new filter
      1. Call the filter Important or Severity 1
      2. Check the Importance box and set it to High
      3. Check the box for Forwarding as Level 1
    1. Save your new filter

NOTE: Consider also filtering on emails from  splun@mydomain.com if you don’t want all emails set with Importance High to generate an alert.

ackByEmail.tgz
  • Note that one more change needs to be made to the alertmon.py script. On line 228, the following text needs to be take off the end of the line:
    and time.strftime('%A')!='Wednesday'

    I had it in there to test alert escalation and didn't get it out before I tar'd it up. I don't see any way to replace or delete my uploads.