cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 10

HA host going into down(red)state

We started having an issue about a month ago.  We are seeing our servers go into a down state in High Availability Deployment summary; thus the HA pools are disabled.

What I am seeing is that the "LastHeartBeat" Timestamp table in DBO.HA_PoolMembers stops updating causing this issue.  The workaround is to force stop the HA service on the standby servers.

I opened a case with support but they really are not able to find a problem.

The NPM version is 12.3 and Orion Platform is 2018.2. HF6.

Here are some screenshots of HA deployment summary:

pastedImage_0.png

pastedImage_1.png

Tags (2)
0 Kudos
25 Replies

Phew.... I'm not the only one

All seemed to start for me when i upgraded to 12.4 and around 3am they fail. I now have two custom query widgets on my dashboard to keep an eye on the pools (queries below).

When i get a failed pool i have to restart the HA service on the standby and then the primary 20 seconds later, any quicker or in the wrong order and it tends to trigger a failover.

HA Pools

SELECT

HAP.PoolId AS [Pool ID],

HAP.DisplayName AS [Pool Name],

CASE

    WHEN HAP.CurrentStatus = '0' THEN 'Failed'

    WHEN HAP.CurrentStatus = '1' THEN 'Online'

    WHEN HAP.CurrentStatus = '3' THEN 'Degraded'

    WHEN HAP.CurrentStatus = '4'THEN 'Disabled'

    ELSE TOSTRING(HAP.CurrentStatus)

    END AS [Status],

CASE

    WHEN HAP.CurrentStatus = '0' THEN '/Orion/images/StatusIcons/small-down.gif'

    WHEN HAP.CurrentStatus = '1' THEN '/Orion/images/StatusIcons/small-up.gif'

    WHEN HAP.CurrentStatus = '3' THEN '/Orion/images/StatusIcons/small-warning.gif'

    WHEN HAP.CurrentStatus = '4' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'

    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

    END AS [_IconFor_Status],

CASE

    WHEN HAP.PoolType = '0' THEN 'Main Poller'

    WHEN HAP.PoolType = '1' THEN 'Additional Poller'

    ELSE HAP.PoolType

    END AS [Pool Type],

CASE

    WHEN HAP.Enabled = 'false' THEN 'False'

    WHEN HAP.Enabled = 'true' THEN 'True'

    END AS [Enabled],

CASE

    WHEN HAP.Enabled = 'True' THEN '/Orion/Admin/Accounts/images/icons/ok_enabled.png'

    WHEN HAP.Enabled = 'False' THEN '/Orion/Admin/Accounts/images/icons/disable.png'

    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

    END AS [_IconFor_Enabled],

CASE

    WHEN HAP.VirtualIpAddress IS NULL THEN TOSTRING('DNS')

    ELSE HAP.VirtualIpAddress

    END AS [HA Method]

FROM Orion.HA.Pools HAP

ORDER BY HAP.PoolType,HAP.DisplayName

HA Pool Members

SELECT

HAP.DisplayName AS [Pool Name],

HAPM.HostName AS [Server],

N.DetailsUrl AS [_LinkFor_Server],

'/Orion/images/StatusIcons/Small-' + StatusIcon AS [_IconFor_Server],

    CASE

    WHEN HAPM.Status = '0' THEN ' Failed'

    WHEN HAPM.Status = '1' THEN ' Online'

    WHEN HAPM.Status = '2' THEN ' Down'

    WHEN HAPM.Status = '14' THEN ' Critical'

    WHEN HAPM.Status = '27'THEN ' Disabled'

    ELSE TOSTRING(HAPM.Status)

    END AS [HA Status],

    CASE

    WHEN HAPM.Status = '0' THEN '/Orion/images/StatusIcons/small-down.gif'

    WHEN HAPM.Status = '1' THEN '/Orion/images/StatusIcons/small-up.gif'

    WHEN HAPM.Status = '2' THEN '/Orion/images/StatusIcons/small-down.gif'

    WHEN HAPM.Status = '14' THEN '/Orion/images/StatusIcons/small-critical.gif'

    WHEN HAPM.Status = '27' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'

    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

    END AS [_IconFor_HA Status],

    CASE

    WHEN HAPM.PoolMemberType = 'MainPoller' THEN 'Main Poller'

    WHEN HAPM.PoolMemberType = 'MainPollerStandby' THEN 'Main Poller - HA'

    WHEN HAPM.PoolMemberType = 'AdditionalPoller' THEN 'Addtional Poller'

    WHEN HAPM.PoolMemberType = 'AdditionalPollerStandby' THEN 'Additional Poller - HA'

    ELSE TOSTRING(HAPM.PoolMemberType)

    END AS [Pool Member Type],

HAPM.LastHeartBeatTimestamp AS [Last HeartBeat],

SecondDiff(HAPM.LastHeartBeatTimestamp, GETUTCDATE()) AS [Last Heartbeat (Seconds)]

FROM Orion.HA.PoolMembers HAPM

INNER JOIN Orion.HA.Pools HAP ON HAP.PoolId = HAPM.PoolId

INNER JOIN Orion.Nodes N ON N.Caption = HAPM.Hostname

ORDER BY HAPM.DisplayName

It's frustrating for sure.  I was wondering if this was a fixed issue in NPM 12.4 but apparently not.  The problem seemed to surface around the time we added additional pollers.  The new pollers we added do not utilize HA.

I have also noticed if you do not space out the HA restart on the main poller pool it will trigger a failover.

I have also noticed I have to force stop HA.  I use this Powershell script:

ForEach ($system in Get-Content H:\tools\sec-ha-servers.txt)

{

invoke-command -ComputerName $system -ScriptBlock {Stop-Process -Force -processname Solarwinds.Highavailability.Service}

}

I am going to stay persistent with support on this.  We would rather not have to trigger an automation in order to keep HA in working order.

0 Kudos

Had a similar issue at a client, they found this:

Verify that RabbitMQ is not disabled.

  1. Go to <OrionServerName>/Orion/Admin/AdvancedConfiguration/Global.aspx
  2. On the Global tab, search for PubSubOverMessageBusEnabled and make sure the check box is selected. Click to select it, if necessary. Save your changes.
  3. Restart all services.
0 Kudos

Thanks - yes we do have the checkbox ticked off for this setting:

pastedImage_0.png

0 Kudos

How are the „server specific“ settings? Check the thread here: Reccommendations engine kept stopping [Solved]

0 Kudos