Since the new HA module came out I once in a while have run into cases where my servers would be bouncing back and forth causing me some headaches. This thread is mostly just a place I wanted to write down some of what I saw and my thoughts so maybe it can be helpful to others who are struggling with their HA environments.
I dug through the database and the logs and was able to resolve the issues in my specific cases but I never came across a central "reason for failover" kind of value that was actually useful. Have I just missed it or does this not currently exist?
The situation I ran into most recently was that my primary poller was pretty overloaded while we shuffled some nodes around fixing an issue with an APE. Nothing was crashing but I suspected that since the system was slow to respond that was triggering the failovers, but the failover itself was WAY more disruptive to my work than it would have been to just let me finish what I was doing and move nodes back off the primary engine to their home APE. Eventually after several failovers interrupting me I went ahead and killed the HA pool so I could finish what I was doing and the system stayed up just fine while overloaded, just ran a bit slow until we finished. When I went to dig around and see if there was a conclusive indicator telling me that we had definitely breached some specific threshold or something I couldn't find one so it prompted me to post this thread.
I can get a few metrics that might be minimally useful from the HA_PoolsView such as when the pool last changed over but nothing pointing at why it triggered.
In most cases if I look at the HA_PoolMembersView you would think that StatusMessage might have something useful, but it is typically null.
HA_PoolMembers has columns called ReasonofFail and StatusMessage, but they are also null in most cases.
HA_Audit has events for each failover event letting me know the xx server is up or down, but still no indication of why.
Then we get into parsing the log files themselves.
\ProgramData\Application Data\SolarWinds\Logs\HighAvailability has two files, HighAvailability.Service and HighAvailability.KeepAlive. The keepalive file is generally empty when I've looked at it, and the service file is a little dense. By looking for "WARN" lines I was able to spot some cases where "Call to Monitor() timed out." so that told me the initial idea was probably correct.
I can see in the LOCAL POOL SNAPSHOT some intervals defined like so:
IntervalMemberDown: '00:00:32', IntervalPoolTask: '00:00:08', IntervalSuicideRule: 00:00:29
Is there anywhere you could tweak those thresholds to make the system maybe a bit more tolerant of situations where the primary poller is CPU bottlenecked for a bit? Like I mentioned earlier, a string of back and forth failovers is much more disruptive to operations than just having a slow console while I was in the middle of a big change.
I did come across an executable in the main Orion folder called HAEnableDisable, which probably could have been helpful. You need to launch it from the command line and it has flags for /info /disablepool /enablepool /disableha /enableha
Anyway, hope you guys find something useful in all that for future reference.
Loop1 Systems: SolarWinds Training and Professional Services