This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

What do you do to protect against the "alerts not resetting" condition?

Hi All,

Working with solarwinds for a while i've come across a few flavours of the "alerts not resetting issue", they include:

The alert triggered, the object has changed, the reset condition no longer applies because it would now fall out of scope. - This is annoying, but a logical outcome i suppose.

The alert should reset under a specificed condition (say a time-based 30 min reset), the server was dodgy during the reset window and so the reset action hasnt occured, and doesnt get rescheduled - This is the one I'm primarily thinking of at the moment.

Others exist/have existed too ofc

What do you do to protect yourself from issues of this sort?

  • Unforunatly for us, its a reboot.

    I normally see a large uptick in our node rebooted alert not automatically clearning (automatically clears itself after 10 minutes).  When I see that going on, normally I need to bring things up quickly so a reboot of our primary does a good job of cleaning up it.  I know that is more of a work around then a legit solution.

  • I have a heartbeat alert that triggers to our ticking tool.  It alerts on a good condition (so it always fires), and resets on a timer.

    My ticketing tool alerts me if it has not received x (2) alerts in the last y period of time (30 minutes). 

    The root cause is usually alert service hangs (in some instances only for some alert types, more below).  90% of the time a service restart fixes it, the other 10% we have to reboot the server.


    Most common cause of the hang is SQL cluster fail over.

    interesting enough, alert reset condition set to timer have the most issues (confirmed with support over several support tickets, sounds like a known defect).  So this heartbeat seems to detect it the best.

  • This is off-topic, so if you can point me in a direction that would be great. 

    I noticed that you said this can be caused by SQL cluster failover.  I've noticed that when we have a cluster failover SolarWinds rarely handles this well.  I don't know if this is a SolarWinds or OS issue.  Has anyone found out why this is and a way to mitigate the problem?  The only solution I've found is restarting the SolarWinds server to force it to re-drive all of its connections.

  • Assuming your having the same issue I am.  Its a SW issue.  I've asked a few times via support tickets and was told the only solution is to restart all servers (starting with MPE).  I have asked for defects to be pushed to development but I am not optimistic, and I suspect enterprise customers using clusters and HA is a minority of their customer base. 

    The only solution I have found is to restart servers.

  • Wondering if you had Sam module to monitor the sql cluster with appinsight then when it changes trigger a service restart?

  • While this is clever, i've found that asking solarwinds to fix itself on failure usually doesn't work, because of the failure!

    Worth a shot though. I seem to have had less issues with SQL failover than some of the others here

  • This is a very good idea, I'll be thinking over how to implement something like this I think

  • Have you looked at this article?

    https://support.solarwinds.com/SuccessCenter/s/article/The-Alert-me-when-a-node-reboots-alert-does-not-reset?language=en_US

    While it really doesn't explain much it does offer a solution that is slightly better than restarting all the servers. Of course you still have to stop all the services, and we all know how long it can take for the Solarwinds platform to recover after a restart. I've been having to run through this process every couple of months, it seems.

    I'm intrigued by the reference about the database failover. While this wouldn't have been the case in our first instance, we have since moved the database to a SQL cluster. I'll have to look into that to see if that may be what is triggering it.