25 Replies Latest reply on Sep 9, 2019 2:08 AM by dsimpkins

    HA host going into down(red)state

    wlouisharris

      We started having an issue about a month ago.  We are seeing our servers go into a down state in High Availability Deployment summary; thus the HA pools are disabled.

       

      What I am seeing is that the "LastHeartBeat" Timestamp table in DBO.HA_PoolMembers stops updating causing this issue.  The workaround is to force stop the HA service on the standby servers.

       

      I opened a case with support but they really are not able to find a problem.

       

      The NPM version is 12.3 and Orion Platform is 2018.2. HF6.

       

      Here are some screenshots of HA deployment summary:

       

       

        • Re: HA host going into down(red)state
          HerrDoktor

          Had a similar issue at a client, they found this:

           

          Verify that RabbitMQ is not disabled.

          1. Go to <OrionServerName>/Orion/Admin/AdvancedConfiguration/Global.aspx
          2. On the Global tab, search for PubSubOverMessageBusEnabled and make sure the check box is selected. Click to select it, if necessary. Save your changes.
          3. Restart all services.
          • Re: HA host going into down(red)state
            mesverrum

            Can't offer a solution, but just want to point out that you are not alone, i had a few clients lately who have HA pools that just randomly die until a service restart brings them back.  As an interim solution we just scripted a stop and start of the services every night and that seems to keep them healthy during the rest of the day until support can come up with an actual solution.

              • Re: HA host going into down(red)state
                dsimpkins

                Phew.... I'm not the only one

                 

                All seemed to start for me when i upgraded to 12.4 and around 3am they fail. I now have two custom query widgets on my dashboard to keep an eye on the pools (queries below).

                When i get a failed pool i have to restart the HA service on the standby and then the primary 20 seconds later, any quicker or in the wrong order and it tends to trigger a failover.

                 

                HA Pools

                 

                SELECT

                HAP.PoolId AS [Pool ID],

                HAP.DisplayName AS [Pool Name],

                CASE

                    WHEN HAP.CurrentStatus = '0' THEN 'Failed'

                    WHEN HAP.CurrentStatus = '1' THEN 'Online'

                    WHEN HAP.CurrentStatus = '3' THEN 'Degraded'

                    WHEN HAP.CurrentStatus = '4'THEN 'Disabled'

                    ELSE TOSTRING(HAP.CurrentStatus)

                    END AS [Status],

                CASE

                    WHEN HAP.CurrentStatus = '0' THEN '/Orion/images/StatusIcons/small-down.gif'

                    WHEN HAP.CurrentStatus = '1' THEN '/Orion/images/StatusIcons/small-up.gif'

                    WHEN HAP.CurrentStatus = '3' THEN '/Orion/images/StatusIcons/small-warning.gif'

                    WHEN HAP.CurrentStatus = '4' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'

                    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

                    END AS [_IconFor_Status],

                CASE

                    WHEN HAP.PoolType = '0' THEN 'Main Poller'

                    WHEN HAP.PoolType = '1' THEN 'Additional Poller'

                    ELSE HAP.PoolType

                    END AS [Pool Type],

                CASE

                    WHEN HAP.Enabled = 'false' THEN 'False'

                    WHEN HAP.Enabled = 'true' THEN 'True'

                    END AS [Enabled],

                CASE

                    WHEN HAP.Enabled = 'True' THEN '/Orion/Admin/Accounts/images/icons/ok_enabled.png'

                    WHEN HAP.Enabled = 'False' THEN '/Orion/Admin/Accounts/images/icons/disable.png'

                    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

                    END AS [_IconFor_Enabled],

                CASE

                    WHEN HAP.VirtualIpAddress IS NULL THEN TOSTRING('DNS')

                    ELSE HAP.VirtualIpAddress

                    END AS [HA Method]

                 

                FROM Orion.HA.Pools HAP

                ORDER BY HAP.PoolType,HAP.DisplayName

                 

                 

                HA Pool Members

                 

                 

                SELECT

                HAP.DisplayName AS [Pool Name],

                HAPM.HostName AS [Server],

                N.DetailsUrl AS [_LinkFor_Server],

                '/Orion/images/StatusIcons/Small-' + StatusIcon AS [_IconFor_Server],

                    CASE

                    WHEN HAPM.Status = '0' THEN ' Failed'

                    WHEN HAPM.Status = '1' THEN ' Online'

                    WHEN HAPM.Status = '2' THEN ' Down'

                    WHEN HAPM.Status = '14' THEN ' Critical'

                    WHEN HAPM.Status = '27'THEN ' Disabled'

                    ELSE TOSTRING(HAPM.Status)

                    END AS [HA Status],

                    CASE

                    WHEN HAPM.Status = '0' THEN '/Orion/images/StatusIcons/small-down.gif'

                    WHEN HAPM.Status = '1' THEN '/Orion/images/StatusIcons/small-up.gif'

                    WHEN HAPM.Status = '2' THEN '/Orion/images/StatusIcons/small-down.gif'

                    WHEN HAPM.Status = '14' THEN '/Orion/images/StatusIcons/small-critical.gif'

                    WHEN HAPM.Status = '27' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'

                    ELSE '/Orion/images/StatusIcons/Small-unknown.gif'

                    END AS [_IconFor_HA Status],

                    CASE

                    WHEN HAPM.PoolMemberType = 'MainPoller' THEN 'Main Poller'

                    WHEN HAPM.PoolMemberType = 'MainPollerStandby' THEN 'Main Poller - HA'

                    WHEN HAPM.PoolMemberType = 'AdditionalPoller' THEN 'Addtional Poller'

                    WHEN HAPM.PoolMemberType = 'AdditionalPollerStandby' THEN 'Additional Poller - HA'

                    ELSE TOSTRING(HAPM.PoolMemberType)

                    END AS [Pool Member Type],

                 

                HAPM.LastHeartBeatTimestamp AS [Last HeartBeat],

                SecondDiff(HAPM.LastHeartBeatTimestamp, GETUTCDATE()) AS [Last Heartbeat (Seconds)]

                 

                FROM Orion.HA.PoolMembers HAPM

                 

                INNER JOIN Orion.HA.Pools HAP ON HAP.PoolId = HAPM.PoolId

                INNER JOIN Orion.Nodes N ON N.Caption = HAPM.Hostname

                 

                ORDER BY HAPM.DisplayName

                  • Re: HA host going into down(red)state
                    wlouisharris

                    It's frustrating for sure.  I was wondering if this was a fixed issue in NPM 12.4 but apparently not.  The problem seemed to surface around the time we added additional pollers.  The new pollers we added do not utilize HA.

                     

                    I have also noticed if you do not space out the HA restart on the main poller pool it will trigger a failover.

                     

                    I have also noticed I have to force stop HA.  I use this Powershell script:

                     

                    ForEach ($system in Get-Content H:\tools\sec-ha-servers.txt)

                    {

                    invoke-command -ComputerName $system -ScriptBlock {Stop-Process -Force -processname Solarwinds.Highavailability.Service}

                    }

                     

                    I am going to stay persistent with support on this.  We would rather not have to trigger an automation in order to keep HA in working order.

                • Re: HA host going into down(red)state
                  wlouisharris

                  Also, here is a log message I found in the HighAvailability.service log in C:\ProgramData\SolarWinds\Logs\HighAvailability

                   

                  It references skipping synchronization because of pending sync task.  The time stamp of the log is 3:09pm but you see the last sync completed was at 6:38am the same day; hence causing the "down" state.

                   

                   

                  PoolSyncCoordinator:

                  PoolNodes:

                    SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode - LastSyncCompletedTimestamp=29-03-2019 06:38:11.358, SyncTasksInProgressCount=2

                    POOL MEMBER [SlaveServiceProxy net.tcp://wpcaf02:17777/HA/] - LastSyncCompletedTimestamp=29-03-2019 07:09:12.571, SyncTasksInProgressCount=1

                  ShouldSynchronize:True

                  NoContactWithPoolMaster:True

                  2019-03-29 15:09:28,294 [HighAvailabilityServiceContainerThread] INFO  SolarWinds.HighAvailability.Kernel.PoolManagement.PoolSyncCoordinator - >WPC6BBD-5-MainPoller: Skipping SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode synchronization, because of pending sync task (PoolId: 7)

                  • Re: HA host going into down(red)state
                    wlouisharris

                    The only thing we have done so far with support is pull diagnostic files.  All our servers are up to date with current Microsoft patches now so we have eliminated that as a possibility.

                    • Re: HA host going into down(red)state
                      wlouisharris

                      I'm still working with support on the case.  They did give me a cool alert to setup.  Basically it does a SWIS query against the HA pool members table.  I'll upload the xml file they gave me as well.

                       

                       

                      Inner join HA_PoolMembers

                      on Nodes.Sysname = HA_PoolMembers.Hostname

                      where HA_PoolMembers.status != '2'

                       

                      • Re: HA host going into down(red)state
                        wlouisharris

                        Update on this case. I am still working with support.  We really have not found any evidence pointing to root cause.  I really cannot say what the origin of the problem is at this point but will update the post.  I did work with my contact at Solarwinds to write a query that will show us when the lastheartbeat time stamp is greater than 1 hour of the current UTC.  This alerts us when the condition occurs.  We do see some cases where status=1 and we have HA in critical state.

                         

                        • Re: HA host going into down(red)state
                          wlouisharris

                          Update on our case.  We need to run validation for a few more days but I think we found the root cause.  We are running SQL HA in 2 separate sites.  Our DBA's had mistakenly set the replication to synchronous.  Since the database VM's are in separate sites we need asynchronous replication.  After configuring asynchronous replication in SQL we have not seen the HA nodes change to critical state and the HA table stop updating.  We also setup the configuration manager to point to the FQDN of the SQL listener.  I think the replication configuration may be the culprit.  Will update the case once we are sure.

                           

                          They key error we saw and that support pointed out was in the Solarwinds application event log:

                           

                          og Name: SolarWinds.Net

                          Source: SolarWinds.SyslogTraps.SyslogService

                          Date: 4/23/2019 3:51:49 AM

                          Event ID: 4001

                          Task Category: None

                          Level: Error

                          Keywords: Classic

                          User: N/A

                          Computer: HOST-our-host

                          Description:

                          Service was unable to open new database connection when requested.

                          SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The remote computer refused the network connection.)

                          Connection string - Data Source=OUR_SQL_Instance ;Initial Catalog=SolarWindsOrion;Persist Security Info=False;User ID=SQL-dummy-user ;X;Max Pool Size=1000;Connect Timeout=20;Load Balance Timeout=120;Packet Size=4096;Application Name="Orion Syslog Service@SyslogService.exe";Workstation ID=Ourhost;MultiSubnetFailover=True

                          Event Xml:

                          1 of 1 people found this helpful