cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Highlighted
Level 10

HA host going into down(red)state

We started having an issue about a month ago.  We are seeing our servers go into a down state in High Availability Deployment summary; thus the HA pools are disabled.

What I am seeing is that the "LastHeartBeat" Timestamp table in DBO.HA_PoolMembers stops updating causing this issue.  The workaround is to force stop the HA service on the standby servers.

I opened a case with support but they really are not able to find a problem.

The NPM version is 12.3 and Orion Platform is 2018.2. HF6.

Here are some screenshots of HA deployment summary:

pastedImage_0.png

pastedImage_1.png

Tags (2)
0 Kudos
25 Replies
Highlighted

Re: HA host going into down(red)state

Had a similar issue at a client, they found this:

Verify that RabbitMQ is not disabled.

  1. Go to <OrionServerName>/Orion/Admin/AdvancedConfiguration/Global.aspx
  2. On the Global tab, search for PubSubOverMessageBusEnabled and make sure the check box is selected. Click to select it, if necessary. Save your changes.
  3. Restart all services.
0 Kudos
Highlighted
Level 10

Re: HA host going into down(red)state

Thanks - yes we do have the checkbox ticked off for this setting:

pastedImage_0.png

0 Kudos
Highlighted

Re: HA host going into down(red)state

How are the „server specific“ settings? Check the thread here: Reccommendations engine kept stopping [Solved]

0 Kudos

Re: HA host going into down(red)state

Can't offer a solution, but just want to point out that you are not alone, i had a few clients lately who have HA pools that just randomly die until a service restart brings them back.  As an interim solution we just scripted a stop and start of the services every night and that seems to keep them healthy during the rest of the day until support can come up with an actual solution.

- Marc Netterfield, Github
0 Kudos
Highlighted

Re: HA host going into down(red)state

Phew.... I'm not the only one

All seemed to start for me when i upgraded to 12.4 and around 3am they fail. I now have two custom query widgets on my dashboard to keep an eye on the pools (queries below).

When i get a failed pool i have to restart the HA service on the standby and then the primary 20 seconds later, any quicker or in the wrong order and it tends to trigger a failover.

HA Pools

SELECT

HAP.PoolId AS [Pool ID],

HAP.DisplayName AS [Pool Name],

CASE

    WHEN HAP.CurrentStatus = '0' THEN 'Failed'

    WHEN HAP.CurrentStatus = '1' THEN 'Online'

    WHEN HAP.CurrentStatus = '3' THEN 'Degraded'

    WHEN HAP.CurrentStatus = '4'THEN 'Disabled'

    ELSE TOSTRING(HAP.CurrentStatus)

    END AS [Status],

CASE

    WHEN HAP.CurrentStatus = '0' THEN '/Orion/images/StatusIcons/small-down.gif'

    WHEN HAP.CurrentStatus = '1' THEN '/Orion/images/StatusIcons/small-up.gif'

    WHEN HAP.CurrentStatus = '3' THEN '/Orion/images/StatusIcons/small-warning.gif'

    WHEN HAP.CurrentStatus = '4' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'

    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

    END AS [_IconFor_Status],

CASE

    WHEN HAP.PoolType = '0' THEN 'Main Poller'

    WHEN HAP.PoolType = '1' THEN 'Additional Poller'

    ELSE HAP.PoolType

    END AS [Pool Type],

CASE

    WHEN HAP.Enabled = 'false' THEN 'False'

    WHEN HAP.Enabled = 'true' THEN 'True'

    END AS [Enabled],

CASE

    WHEN HAP.Enabled = 'True' THEN '/Orion/Admin/Accounts/images/icons/ok_enabled.png'

    WHEN HAP.Enabled = 'False' THEN '/Orion/Admin/Accounts/images/icons/disable.png'

    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

    END AS [_IconFor_Enabled],

CASE

    WHEN HAP.VirtualIpAddress IS NULL THEN TOSTRING('DNS')

    ELSE HAP.VirtualIpAddress

    END AS [HA Method]

FROM Orion.HA.Pools HAP

ORDER BY HAP.PoolType,HAP.DisplayName

HA Pool Members

SELECT

HAP.DisplayName AS [Pool Name],

HAPM.HostName AS [Server],

N.DetailsUrl AS [_LinkFor_Server],

'/Orion/images/StatusIcons/Small-' + StatusIcon AS [_IconFor_Server],

    CASE

    WHEN HAPM.Status = '0' THEN ' Failed'

    WHEN HAPM.Status = '1' THEN ' Online'

    WHEN HAPM.Status = '2' THEN ' Down'

    WHEN HAPM.Status = '14' THEN ' Critical'

    WHEN HAPM.Status = '27'THEN ' Disabled'

    ELSE TOSTRING(HAPM.Status)

    END AS [HA Status],

    CASE

    WHEN HAPM.Status = '0' THEN '/Orion/images/StatusIcons/small-down.gif'

    WHEN HAPM.Status = '1' THEN '/Orion/images/StatusIcons/small-up.gif'

    WHEN HAPM.Status = '2' THEN '/Orion/images/StatusIcons/small-down.gif'

    WHEN HAPM.Status = '14' THEN '/Orion/images/StatusIcons/small-critical.gif'

    WHEN HAPM.Status = '27' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'

    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

    END AS [_IconFor_HA Status],

    CASE

    WHEN HAPM.PoolMemberType = 'MainPoller' THEN 'Main Poller'

    WHEN HAPM.PoolMemberType = 'MainPollerStandby' THEN 'Main Poller - HA'

    WHEN HAPM.PoolMemberType = 'AdditionalPoller' THEN 'Addtional Poller'

    WHEN HAPM.PoolMemberType = 'AdditionalPollerStandby' THEN 'Additional Poller - HA'

    ELSE TOSTRING(HAPM.PoolMemberType)

    END AS [Pool Member Type],

HAPM.LastHeartBeatTimestamp AS [Last HeartBeat],

SecondDiff(HAPM.LastHeartBeatTimestamp, GETUTCDATE()) AS [Last Heartbeat (Seconds)]

FROM Orion.HA.PoolMembers HAPM

INNER JOIN Orion.HA.Pools HAP ON HAP.PoolId = HAPM.PoolId

INNER JOIN Orion.Nodes N ON N.Caption = HAPM.Hostname

ORDER BY HAPM.DisplayName

Highlighted
Level 10

Re: HA host going into down(red)state

It's frustrating for sure.  I was wondering if this was a fixed issue in NPM 12.4 but apparently not.  The problem seemed to surface around the time we added additional pollers.  The new pollers we added do not utilize HA.

I have also noticed if you do not space out the HA restart on the main poller pool it will trigger a failover.

I have also noticed I have to force stop HA.  I use this Powershell script:

ForEach ($system in Get-Content H:\tools\sec-ha-servers.txt)

{

invoke-command -ComputerName $system -ScriptBlock {Stop-Process -Force -processname Solarwinds.Highavailability.Service}

}

I am going to stay persistent with support on this.  We would rather not have to trigger an automation in order to keep HA in working order.

0 Kudos
Highlighted
Level 10

Re: HA host going into down(red)state

Also, here is a log message I found in the HighAvailability.service log in C:\ProgramData\SolarWinds\Logs\HighAvailability

It references skipping synchronization because of pending sync task.  The time stamp of the log is 3:09pm but you see the last sync completed was at 6:38am the same day; hence causing the "down" state.

PoolSyncCoordinator:

PoolNodes:

  SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode - LastSyncCompletedTimestamp=29-03-2019 06:38:11.358, SyncTasksInProgressCount=2

  POOL MEMBER [SlaveServiceProxy net.tcp://wpcaf02:17777/HA/] - LastSyncCompletedTimestamp=29-03-2019 07:09:12.571, SyncTasksInProgressCount=1

ShouldSynchronize:True

NoContactWithPoolMaster:True

2019-03-29 15:09:28,294 [HighAvailabilityServiceContainerThread] INFO  SolarWinds.HighAvailability.Kernel.PoolManagement.PoolSyncCoordinator - >WPC6BBD-5-MainPoller: Skipping SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode synchronization, because of pending sync task (PoolId: 7)

0 Kudos
Highlighted
Product Manager
Product Manager

Re: HA host going into down(red)state

wlouisharris​, do you by chance have a support case open for this issue? If so, would you be willing to share your case number?

0 Kudos
Highlighted
Level 10

Re: HA host going into down(red)state

Hey alterego - here is the support case # -

CASE # 00276075 - HA POOLS SWITCHING TO UNKOWN STATE

0 Kudos