Hello.
We have a dozen or so Orion servers monitoring various customers' networks and have had no problems with any of them using basic alerts for many years.
On some of the servers we needed to use advanced alerts so set them up a few months back as required but have found them to be unreliable.
On some occasions the Solarwinds Alerting Engine service had been found to have stopped. It was running under the "local system" account and although it was set to restart the service in the event of a failure this wasn't working.
On other occasions the Alerting Engine service was still running but our email notifications weren't being triggered. Stopping and restarting the service cleared the problem for a while and email notifications started up again.
As is often the way , this happened on different servers and was fixed by different people and it took a while before we realised that we were seeing the same fault occurring on several servers , seemingly independent of each other.
Some of these machines are in a domain , while a couple are stand alone systems in a workgroup based on customer sites that have no connection to the other machines.
As part of my investigation I changed the Alerting Engine service to run under a domain account that has administrative priviledges on those servers and that has made the service much more stable , though we have had an instance of the service stopping yesterday.
Not all of the servers were running current versions of Orion - some were 9.1.0 SP4 with a couple of 9.0.0 and one 8.50 - they have now all been upgraded to 9.1.0 SP4. I can't be 100% certain that the original installation was done using a local administrator account and not a domain administrator account but I made sure that this was the case when they were upgraded. I'm aware that using a domain account to do the install can sometimes cause issues but we have the same problem on machines that were definitely installed using a local account so I don't believe that's the cause here.
In the Solarwinds windows event log I've seen a number of errors like this which *I think* correspond to the service stopping but I can't be certain of that as nobody thought to record dates and times of the problems occurring.
2009-04-22 09:54:27,576 [MainTaskThread] ERROR All - Error in TaskManager loop -
EXCEPTION STACK: --> Thread was being aborted. : mscorlib : at System.Threading.Thread.SleepInternal(Int32 millisecondsTimeout)
at AlertingEngine.SWAlertingEngine.TaskManager();;;
Does anyone have any experience of this happening and how to resolve it? I'd flatten the servers and rebuild from scratch it I was sure this would stop this re-occurring.