This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

What do you do to protect against the "alerts not resetting" condition?

adam.beedell over 1 year ago

Hi All,

Working with solarwinds for a while i've come across a few flavours of the "alerts not resetting issue", they include:

The alert triggered, the object has changed, the reset condition no longer applies because it would now fall out of scope. - This is annoying, but a logical outcome i suppose.

The alert should reset under a specificed condition (say a time-based 30 min reset), the server was dodgy during the reset window and so the reset action hasnt occured, and doesnt get rescheduled - This is the one I'm primarily thinking of at the moment.

Others exist/have existed too ofc

What do you do to protect yourself from issues of this sort?

Top Replies

0 SteveK over 1 year ago

Unforunatly for us, its a reboot.

I normally see a large uptick in our node rebooted alert not automatically clearning (automatically clears itself after 10 minutes). When I see that going on, normally I need to bring things up quickly so a reboot of our primary does a good job of cleaning up it. I know that is more of a work around then a legit solution.
Cancel
Vote Up +1 Vote Down

Cancel
+1 monitoringlife over 1 year ago

I have a heartbeat alert that triggers to our ticking tool. It alerts on a good condition (so it always fires), and resets on a timer.

My ticketing tool alerts me if it has not received x (2) alerts in the last y period of time (30 minutes).

The root cause is usually alert service hangs (in some instances only for some alert types, more below). 90% of the time a service restart fixes it, the other 10% we have to reboot the server.

Most common cause of the hang is SQL cluster fail over.

interesting enough, alert reset condition set to timer have the most issues (confirmed with support over several support tickets, sounds like a known defect). So this heartbeat seems to detect it the best.
Cancel
Vote Up +2 Vote Down

Cancel
0 sguido over 1 year ago in reply to monitoringlife

This is off-topic, so if you can point me in a direction that would be great.

I noticed that you said this can be caused by SQL cluster failover. I've noticed that when we have a cluster failover SolarWinds rarely handles this well. I don't know if this is a SolarWinds or OS issue. Has anyone found out why this is and a way to mitigate the problem? The only solution I've found is restarting the SolarWinds server to force it to re-drive all of its connections.
Cancel
Vote Up +1 Vote Down

Cancel
0 monitoringlife over 1 year ago in reply to sguido

Assuming your having the same issue I am. Its a SW issue. I've asked a few times via support tickets and was told the only solution is to restart all servers (starting with MPE). I have asked for defects to be pushed to development but I am not optimistic, and I suspect enterprise customers using clusters and HA is a minority of their customer base.

The only solution I have found is to restart servers.
Cancel
Vote Up 0 Vote Down

Cancel
0 dodo123 over 1 year ago in reply to monitoringlife

Wondering if you had Sam module to monitor the sql cluster with appinsight then when it changes trigger a service restart?
Cancel
Vote Up 0 Vote Down

Cancel
0 adam.beedell over 1 year ago in reply to dodo123

While this is clever, i've found that asking solarwinds to fix itself on failure usually doesn't work, because of the failure!

Worth a shot though. I seem to have had less issues with SQL failover than some of the others here
Cancel
Vote Up +1 Vote Down

Cancel
0 adam.beedell over 1 year ago in reply to monitoringlife

This is a very good idea, I'll be thinking over how to implement something like this I think
Cancel
Vote Up +1 Vote Down

Cancel
0 StillGoing over 1 year ago

Have you looked at this article?

https://support.solarwinds.com/SuccessCenter/s/article/The-Alert-me-when-a-node-reboots-alert-does-not-reset?language=en_US

While it really doesn't explain much it does offer a solution that is slightly better than restarting all the servers. Of course you still have to stop all the services, and we all know how long it can take for the Solarwinds platform to recover after a restart. I've been having to run through this process every couple of months, it seems.

I'm intrigued by the reference about the database failover. While this wouldn't have been the case in our first instance, we have since moved the database to a SQL cluster. I'll have to look into that to see if that may be what is triggering it.
Cancel
Vote Up 0 Vote Down

Cancel