Over the last few weeks we have had Solarwinds/Orion installed (NPM,SAM,vMAN).
Now we are setting up Alerts notification, however my SMTP Relay team are asking if the Orion email client supports the following:
"Does the application platform in question (Solarwinds) have an internal email process which has the capability to queue and retry messages if the message is not relayed to our gateways?"
I do not see any settings in 'Advanced Alert Manager' to manage send retries, in the event that a message fails to be relayed by our SMTP gateways.
Am I missing something, or is there no support to manage this type of requirement?
We have not had that question come up and I don't recall seeing any place to configure that kind of setting. I would be very interested if there is a solution for this because it would help us ensure that email notifications (and other email related actions) are actually taking place.
A suggestion from Steven Carlson posted here: Re: Alert emails not being sent when SMTP server is temporarily unavailable - feature request or bug... was to install SMTP services on the Orion server, send email locally, and then have the SMTP service relay email to the Exchange system, This would do retry as you can configure it in the local SMTP service. I have not done this though so I don't know all the possibilities. Your email admin might not like allowing remail relay.
Other than that I have not found a solution that's internal to Orion/AlertCentral/etc.
On a side note:
The next best thing wodul be to have Orion send to a system that can perform retries. No suggestions from me, but any of the mass notification system vendors can probably do it for only a few trainloads of gold and then you'd just have the gap if the secondary system was unavailable which would be unlikely if you could install it on the Orion system..
The only thing I have used to handle retries (and any other things to band-aid needed functionality that is not vendor supplied yet ) is to write two scripts to handle alerting. One script that, as quickly and simply as possible, dumps each alert in a standard, easily re-parseable/readable way like XML to individual files (one alert per file, each uniquely named) into a directory. This script does not write to a log and if it has issues it crashes/exits with non zero error code so the app executing it can detect this...hopefully. The second script does all the heavy lifting, status logging, etc. and processes the files (first in first out, based on file modified times) to forward them. The second script runs from cron/Task Scheduler (run at 1 min intervals or greater) or built into a windows service/daemon if it needs to run faster than 1 min intervals. If the 2nd script successfully forwards the alert (to email system, to database, to other app, etc.) it moves the alert file to an archive location or just deletes it. If it fails it simply leaves the file there to be reprocessed on the next run. It's not "perfect" assome things screw up due to sending things on the command line, reparsing the file, files with identical modified times not being sent in perfect order, some apps wont execute the script 100x per minute if there's a flood, etc. but it works great most of the time.
A similar solution is to write a script that queries Orion/x monitoring system for new alerts, forwards them, then flags each successfully forwarded event in Orion/xmontioring system so it doesn't process it on the second run. It's just about as complicated as the solution above to implement but there's no writing/converting to files so it should be less error prone but you have to hunt down how to connect to 3rd party apps and reverse engineer (at times) how to query them and update their data.
I like the file caching because I can give the scripts to app owners (and handle non windows OS's relatively easy if everything is written in a cross platform language) and can also see the contents of the alert files and easily "replay" them by copying a file from the archive. Almost all monitoring apps have a "run a command line" option so they support the "write alerts to script" option.. The downside is you now have all this to maintain and to make it as bulletproof and flexible as possible for multiple vendors can take a lot of work.
If you check out the v12 beta (Re: NPM 12.0 BETA4 + QOE NOW AVAILABLE) , we surface the ability to set a back-up SMTP server for alerts. If the primary SMTP server doesn't respond, we'll retry on the secondary. This may help you achieve a similar goal of resiliency.
This new failover setting is a nice thing to have, but it probably won't help my main issue that I have been needing retries for: When I see Orion try to send a *lot* (maybe 10? 20? 100?) emails in a short period of time to many different people/groups because of a large outage the Exchange system stops allowing me to send them because (taken from the "Message" field in [SolarWindsOrion].[dbo].[AlertLog]):
"Failed - Was unable to send email message. The message could not be sent to the SMTP server. The transport error code was 0x800ccc67. The server response was 421 4.3.2 The maximum number of concurrent connections has exceeded a limit, closing transmission channel"
The Exchange administrator isn't going to budge on the Exchange settings. The above alert indicates that Orion didn't open just 1 connection to send all emails buy many (probably one per sent email, or one per rule that triggered), but I bet if it did open just one connection and could send more emails through there would be some other limit hit on the Exchange side indicating "too many email sent at once".
Maybe a possible solution/option other than a retry would be to enable sending all emails through a single SMTP connection, esp. if the Exchange side/admin could bump up that threshold (how many emails can be sent/time) on a per-sender basis?
(edit: yes I know the post is a year old. Just doing it as I ran across it again. Maybe it will give someone ideas)
You may find your environment is more friendly to fewer notifications than to many notifications. In that case, create Dependencies to reduce the number of alert messages, and then create a Network Notifications group to receive them. That may be all it takes to reduce the unhappiness of your mail server.
One of my sites has twenty network rooms and a data center. Orion monitors every switch, UPS, server, Access Point, etc. and it will try to send a message to many people if the WAN connection to that that remote site fails, or if the site's router fails.
Solution: Create a Dependency Group to contain all the nodes except the router. Make the router the Parent of that Dependency Group. Set up the Alert to send one e-mail to a group address: NetworkNotifications@yoursite.com.
The next time the WAN fails or the router is unavailable, all the nodes behind the router won't send alerts--only the router will. And it'll only send to one address--the group address you define.
The beauty is how easy this is to do. More beauty comes from building sub-dependencies to limit the messages in case the router remains up but one or more network rooms and their nodes go down.
It's sweet--and maps out nicely with NTM!
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.