We started having an issue about a month ago. We are seeing our servers go into a down state in High Availability Deployment summary; thus the HA pools are disabled.
What I am seeing is that the "LastHeartBeat" Timestamp table in DBO.HA_PoolMembers stops updating causing this issue. The workaround is to force stop the HA service on the standby servers.
I opened a case with support but they really are not able to find a problem.
The NPM version is 12.3 and Orion Platform is 2018.2. HF6.
Here are some screenshots of HA deployment summary:
Update on our case. We need to run validation for a few more days but I think we found the root cause. We are running SQL HA in 2 separate sites. Our DBA's had mistakenly set the replication to synchronous. Since the database VM's are in separate sites we need asynchronous replication. After configuring asynchronous replication in SQL we have not seen the HA nodes change to critical state and the HA table stop updating. We also setup the configuration manager to point to the FQDN of the SQL listener. I think the replication configuration may be the culprit. Will update the case once we are sure.
They key error we saw and that support pointed out was in the Solarwinds application event log:
og Name: SolarWinds.Net
Date: 4/23/2019 3:51:49 AM
Event ID: 4001
Task Category: None
Service was unable to open new database connection when requested.
SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The remote computer refused the network connection.)
Connection string - Data Source=OUR_SQL_Instance ;Initial Catalog=SolarWindsOrion;Persist Security Info=False;User ID=SQL-dummy-user ;X;Max Pool Size=1000;Connect Timeout=20;Load Balance Timeout=120;Packet Size=4096;Application Name="Orion Syslog Service@SyslogService.exe";Workstation ID=Ourhost;MultiSubnetFailover=True
We did finally close our case. I have seen the our HA environment stay in healthy state for over 10 days now. So yes, if you have your HA pool go to critical state, look for any SQL errors in the Solarwinds event log and make sure SQL connectivity is not the root cause.
Update on this case. I am still working with support. We really have not found any evidence pointing to root cause. I really cannot say what the origin of the problem is at this point but will update the post. I did work with my contact at Solarwinds to write a query that will show us when the lastheartbeat time stamp is greater than 1 hour of the current UTC. This alerts us when the condition occurs. We do see some cases where status=1 and we have HA in critical state.
Mine had been quiet for a few weeks and then today i get in and bang all but two pools is red again...
Supports answer was to recreate the pools due to an erroneous record in the database but it still happens.
Have you tried restarting services on one of those pollers? The lack of a heartbeat for that long sounds like connectivity to the database was lost or a service stopped.
Restarting the services is the only way to get them back to green, although when restarting the services it is 50/50 whether it will stop and generally requires the process to be killed on each node in a pair to bring the pool to green.
Next time it does it occurs i'll try and grab some logs to see if that gives any further clues.
Just need my maintenance sorting (today hopefully) and i can get the next RC downloaded to take a look.
Was there anything particular that was addressed that you can share details on? would be good to understand what was the cause and if our environments was a contributing factor.
It's a little too in the details to explain, and quite honestly without a lot of Orion development context, probably even more difficult to understand. If you're uncomfortable upgrading without validation that the issue you're experiencing is the same as the ones addressed in Orion Platform 2019.2, then I recommend opening a case with support. They will be able to investigate the issue and work with engineering to determine if the issue you're encountering is related.
Not a problem, to be honest hearing the issue should be resolved is enough.
If it still occurs then will raise a case to see if our issue is a different one.
Upgraded to 2019.2 HF2 a couple of weeks back and so far so good, haven't had a single issue with HA
Email alerts have been a little bit noiser on the main poller but i can live with that, thanks for the work on this.
I'm still working with support on the case. They did give me a cool alert to setup. Basically it does a SWIS query against the HA pool members table. I'll upload the xml file they gave me as well.
Inner join HA_PoolMembers
on Nodes.Sysname = HA_PoolMembers.Hostname
where HA_PoolMembers.status != '2'
I changed the query a bit. It looks like status=1 is healthy in the HA.poolmembers table. We basically wrote an alert looking for this condition.
SELECT HostName, Status, LastHeartbeatTimestamp
WHERE Status != '1'
The only thing we have done so far with support is pull diagnostic files. All our servers are up to date with current Microsoft patches now so we have eliminated that as a possibility.
Also, here is a log message I found in the HighAvailability.service log in C:\ProgramData\SolarWinds\Logs\HighAvailability
It references skipping synchronization because of pending sync task. The time stamp of the log is 3:09pm but you see the last sync completed was at 6:38am the same day; hence causing the "down" state.
SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode - LastSyncCompletedTimestamp=29-03-2019 06:38:11.358, SyncTasksInProgressCount=2
POOL MEMBER [SlaveServiceProxy net.tcp://wpcaf02:17777/HA/] - LastSyncCompletedTimestamp=29-03-2019 07:09:12.571, SyncTasksInProgressCount=1
2019-03-29 15:09:28,294 [HighAvailabilityServiceContainerThread] INFO SolarWinds.HighAvailability.Kernel.PoolManagement.PoolSyncCoordinator - >WPC6BBD-5-MainPoller: Skipping SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode synchronization, because of pending sync task (PoolId: 7)
Can't offer a solution, but just want to point out that you are not alone, i had a few clients lately who have HA pools that just randomly die until a service restart brings them back. As an interim solution we just scripted a stop and start of the services every night and that seems to keep them healthy during the rest of the day until support can come up with an actual solution.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.