cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 10

HA host going into down(red)state

We started having an issue about a month ago.  We are seeing our servers go into a down state in High Availability Deployment summary; thus the HA pools are disabled.

What I am seeing is that the "LastHeartBeat" Timestamp table in DBO.HA_PoolMembers stops updating causing this issue.  The workaround is to force stop the HA service on the standby servers.

I opened a case with support but they really are not able to find a problem.

The NPM version is 12.3 and Orion Platform is 2018.2. HF6.

Here are some screenshots of HA deployment summary:

pastedImage_0.png

pastedImage_1.png

Tags (2)
0 Kudos
25 Replies
Level 10

Update on our case.  We need to run validation for a few more days but I think we found the root cause.  We are running SQL HA in 2 separate sites.  Our DBA's had mistakenly set the replication to synchronous.  Since the database VM's are in separate sites we need asynchronous replication.  After configuring asynchronous replication in SQL we have not seen the HA nodes change to critical state and the HA table stop updating.  We also setup the configuration manager to point to the FQDN of the SQL listener.  I think the replication configuration may be the culprit.  Will update the case once we are sure.

They key error we saw and that support pointed out was in the Solarwinds application event log:

og Name: SolarWinds.Net

Source: SolarWinds.SyslogTraps.SyslogService

Date: 4/23/2019 3:51:49 AM

Event ID: 4001

Task Category: None

Level: Error

Keywords: Classic

User: N/A

Computer: HOST-our-host

Description:

Service was unable to open new database connection when requested.

SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The remote computer refused the network connection.)

Connection string - Data Source=OUR_SQL_Instance ;Initial Catalog=SolarWindsOrion;Persist Security Info=False;User ID=SQL-dummy-user ;X;Max Pool Size=1000;Connect Timeout=20;Load Balance Timeout=120;Packet Size=4096;Application Name="Orion Syslog Service@SyslogService.exe";Workstation ID=Ourhost;MultiSubnetFailover=True

Event Xml:

We did finally close our case.  I have seen the our HA environment stay in healthy state for over 10 days now.  So yes, if you have your HA pool go to critical state, look for any SQL errors in the Solarwinds event log and make sure SQL connectivity is not the root cause.

Level 10

Update on this case. I am still working with support.  We really have not found any evidence pointing to root cause.  I really cannot say what the origin of the problem is at this point but will update the post.  I did work with my contact at Solarwinds to write a query that will show us when the lastheartbeat time stamp is greater than 1 hour of the current UTC.  This alerts us when the condition occurs.  We do see some cases where status=1 and we have HA in critical state.

pastedImage_0.png

0 Kudos

Mine had been quiet for a few weeks and then today i get in and bang all but two pools is red again...

pastedImage_0.png

Supports answer was to recreate the pools due to an erroneous record in the database but it still happens.

0 Kudos

Have you tried restarting services on one of those pollers? The lack of a heartbeat for that long sounds like connectivity to the database was lost or a service stopped.

0 Kudos

Restarting the services is the only way to get them back to green, although when restarting the services it is 50/50 whether it will stop and generally requires the process to be killed on each node in a pair to bring the pool to green.

Next time it does it occurs i'll try and grab some logs to see if that gives any further clues.

0 Kudos

There are a couple of similar issues reported by other customers which were addressed in Orion Platform 2019.2, included in NPM 12.5 which can be downloaded now from your Customer Portal.

0 Kudos

Nice

Just need my maintenance sorting (today hopefully) and i can get the next RC downloaded to take a look.

Was there anything particular that was addressed that you can share details on? would be good to understand what was the cause and if our environments was a contributing factor.

0 Kudos

It's a little too in the details to explain, and quite honestly without a lot of Orion development context, probably even more difficult to understand. If you're uncomfortable upgrading without validation that the issue you're experiencing is the same as the ones addressed in Orion Platform 2019.2, then I recommend opening a case with support. They will be able to investigate the issue and work with engineering to determine if the issue you're encountering is related.

0 Kudos

Not a problem, to be honest hearing the issue should be resolved is enough.

If it still occurs then will raise a case to see if our issue is a different one.

Thank you

0 Kudos

Hi aLTeReGo​,

Upgraded to 2019.2 HF2 a couple of weeks back and so far so good, haven't had a single issue with HA

Email alerts have been a little bit noiser on the main poller but i can live with that, thanks for the work on this.

Level 10

I'm still working with support on the case.  They did give me a cool alert to setup.  Basically it does a SWIS query against the HA pool members table.  I'll upload the xml file they gave me as well.

Inner join HA_PoolMembers

on Nodes.Sysname = HA_PoolMembers.Hostname

where HA_PoolMembers.status != '2'

pastedImage_0.png

0 Kudos

I changed the query a bit.  It looks like status=1 is healthy in the HA.poolmembers table.  We basically wrote an alert looking for this condition.

SELECT HostName, Status, LastHeartbeatTimestamp

FROM Orion.HA.PoolMembers

WHERE Status != '1'

0 Kudos
Level 10

The only thing we have done so far with support is pull diagnostic files.  All our servers are up to date with current Microsoft patches now so we have eliminated that as a possibility.

0 Kudos
Level 10

Also, here is a log message I found in the HighAvailability.service log in C:\ProgramData\SolarWinds\Logs\HighAvailability

It references skipping synchronization because of pending sync task.  The time stamp of the log is 3:09pm but you see the last sync completed was at 6:38am the same day; hence causing the "down" state.

PoolSyncCoordinator:

PoolNodes:

  SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode - LastSyncCompletedTimestamp=29-03-2019 06:38:11.358, SyncTasksInProgressCount=2

  POOL MEMBER [SlaveServiceProxy net.tcp://wpcaf02:17777/HA/] - LastSyncCompletedTimestamp=29-03-2019 07:09:12.571, SyncTasksInProgressCount=1

ShouldSynchronize:True

NoContactWithPoolMaster:True

2019-03-29 15:09:28,294 [HighAvailabilityServiceContainerThread] INFO  SolarWinds.HighAvailability.Kernel.PoolManagement.PoolSyncCoordinator - >WPC6BBD-5-MainPoller: Skipping SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode synchronization, because of pending sync task (PoolId: 7)

0 Kudos

wlouisharris​, do you by chance have a support case open for this issue? If so, would you be willing to share your case number?

0 Kudos

aLTeReGo​, I had a ticket open a while back for mine but it still reoccurs

Case # 00237473 - HA Service on multiple pools reports failed

0 Kudos

Hey alterego - here is the support case # -

CASE # 00276075 - HA POOLS SWITCHING TO UNKOWN STATE

0 Kudos

I have reached out to support management and requested that your case be escalated.

0 Kudos

Can't offer a solution, but just want to point out that you are not alone, i had a few clients lately who have HA pools that just randomly die until a service restart brings them back.  As an interim solution we just scripted a stop and start of the services every night and that seems to keep them healthy during the rest of the day until support can come up with an actual solution.

- Marc Netterfield, Github
0 Kudos