The only thing we have done so far with support is pull diagnostic files. All our servers are up to date with current Microsoft patches now so we have eliminated that as a possibility.
I'm still working with support on the case. They did give me a cool alert to setup. Basically it does a SWIS query against the HA pool members table. I'll upload the xml file they gave me as well.
Inner join HA_PoolMembers
on Nodes.Sysname = HA_PoolMembers.Hostname
where HA_PoolMembers.status != '2'
I changed the query a bit. It looks like status=1 is healthy in the HA.poolmembers table. We basically wrote an alert looking for this condition.
SELECT HostName, Status, LastHeartbeatTimestamp
WHERE Status != '1'
Update on this case. I am still working with support. We really have not found any evidence pointing to root cause. I really cannot say what the origin of the problem is at this point but will update the post. I did work with my contact at Solarwinds to write a query that will show us when the lastheartbeat time stamp is greater than 1 hour of the current UTC. This alerts us when the condition occurs. We do see some cases where status=1 and we have HA in critical state.
Mine had been quiet for a few weeks and then today i get in and bang all but two pools is red again...
Supports answer was to recreate the pools due to an erroneous record in the database but it still happens.
Have you tried restarting services on one of those pollers? The lack of a heartbeat for that long sounds like connectivity to the database was lost or a service stopped.
Restarting the services is the only way to get them back to green, although when restarting the services it is 50/50 whether it will stop and generally requires the process to be killed on each node in a pair to bring the pool to green.
Next time it does it occurs i'll try and grab some logs to see if that gives any further clues.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.