-
Re: HA host going into down(red)state
HerrDoktorMar 29, 2019 10:04 AM (in response to wlouisharris)
Had a similar issue at a client, they found this:
Verify that RabbitMQ is not disabled.
- Go to <OrionServerName>/Orion/Admin/AdvancedConfiguration/Global.aspx
- On the Global tab, search for PubSubOverMessageBusEnabled and make sure the check box is selected. Click to select it, if necessary. Save your changes.
- Restart all services.
-
-
Re: HA host going into down(red)state
HerrDoktorMar 29, 2019 5:30 PM (in response to wlouisharris)
How are the „server specific“ settings? Check the thread here: Reccommendations engine kept stopping [Solved]
-
-
Re: HA host going into down(red)state
mesverrumMar 29, 2019 10:04 AM (in response to wlouisharris)
Can't offer a solution, but just want to point out that you are not alone, i had a few clients lately who have HA pools that just randomly die until a service restart brings them back. As an interim solution we just scripted a stop and start of the services every night and that seems to keep them healthy during the rest of the day until support can come up with an actual solution.
-
Re: HA host going into down(red)state
dsimpkinsMar 29, 2019 10:48 AM (in response to mesverrum)
Phew.... I'm not the only one
All seemed to start for me when i upgraded to 12.4 and around 3am they fail. I now have two custom query widgets on my dashboard to keep an eye on the pools (queries below).
When i get a failed pool i have to restart the HA service on the standby and then the primary 20 seconds later, any quicker or in the wrong order and it tends to trigger a failover.
HA Pools
SELECT
HAP.PoolId AS [Pool ID],
HAP.DisplayName AS [Pool Name],
CASE
WHEN HAP.CurrentStatus = '0' THEN 'Failed'
WHEN HAP.CurrentStatus = '1' THEN 'Online'
WHEN HAP.CurrentStatus = '3' THEN 'Degraded'
WHEN HAP.CurrentStatus = '4'THEN 'Disabled'
ELSE TOSTRING(HAP.CurrentStatus)
END AS [Status],
CASE
WHEN HAP.CurrentStatus = '0' THEN '/Orion/images/StatusIcons/small-down.gif'
WHEN HAP.CurrentStatus = '1' THEN '/Orion/images/StatusIcons/small-up.gif'
WHEN HAP.CurrentStatus = '3' THEN '/Orion/images/StatusIcons/small-warning.gif'
WHEN HAP.CurrentStatus = '4' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'
ELSE '/Orion/images/StatusIcons/Small-unknown.gif'
END AS [_IconFor_Status],
CASE
WHEN HAP.PoolType = '0' THEN 'Main Poller'
WHEN HAP.PoolType = '1' THEN 'Additional Poller'
ELSE HAP.PoolType
END AS [Pool Type],
CASE
WHEN HAP.Enabled = 'false' THEN 'False'
WHEN HAP.Enabled = 'true' THEN 'True'
END AS [Enabled],
CASE
WHEN HAP.Enabled = 'True' THEN '/Orion/Admin/Accounts/images/icons/ok_enabled.png'
WHEN HAP.Enabled = 'False' THEN '/Orion/Admin/Accounts/images/icons/disable.png'
ELSE '/Orion/images/StatusIcons/Small-unknown.gif'
END AS [_IconFor_Enabled],
CASE
WHEN HAP.VirtualIpAddress IS NULL THEN TOSTRING('DNS')
ELSE HAP.VirtualIpAddress
END AS [HA Method]
FROM Orion.HA.Pools HAP
ORDER BY HAP.PoolType,HAP.DisplayName
HA Pool Members
SELECT
HAP.DisplayName AS [Pool Name],
HAPM.HostName AS [Server],
N.DetailsUrl AS [_LinkFor_Server],
'/Orion/images/StatusIcons/Small-' + StatusIcon AS [_IconFor_Server],
CASE
WHEN HAPM.Status = '0' THEN ' Failed'
WHEN HAPM.Status = '1' THEN ' Online'
WHEN HAPM.Status = '2' THEN ' Down'
WHEN HAPM.Status = '14' THEN ' Critical'
WHEN HAPM.Status = '27'THEN ' Disabled'
ELSE TOSTRING(HAPM.Status)
END AS [HA Status],
CASE
WHEN HAPM.Status = '0' THEN '/Orion/images/StatusIcons/small-down.gif'
WHEN HAPM.Status = '1' THEN '/Orion/images/StatusIcons/small-up.gif'
WHEN HAPM.Status = '2' THEN '/Orion/images/StatusIcons/small-down.gif'
WHEN HAPM.Status = '14' THEN '/Orion/images/StatusIcons/small-critical.gif'
WHEN HAPM.Status = '27' THEN '/Orion/Images/StatusIcons/small-unmanaged.gif'
ELSE '/Orion/images/StatusIcons/Small-unknown.gif'
END AS [_IconFor_HA Status],
CASE
WHEN HAPM.PoolMemberType = 'MainPoller' THEN 'Main Poller'
WHEN HAPM.PoolMemberType = 'MainPollerStandby' THEN 'Main Poller - HA'
WHEN HAPM.PoolMemberType = 'AdditionalPoller' THEN 'Addtional Poller'
WHEN HAPM.PoolMemberType = 'AdditionalPollerStandby' THEN 'Additional Poller - HA'
ELSE TOSTRING(HAPM.PoolMemberType)
END AS [Pool Member Type],
HAPM.LastHeartBeatTimestamp AS [Last HeartBeat],
SecondDiff(HAPM.LastHeartBeatTimestamp, GETUTCDATE()) AS [Last Heartbeat (Seconds)]
FROM Orion.HA.PoolMembers HAPM
INNER JOIN Orion.HA.Pools HAP ON HAP.PoolId = HAPM.PoolId
INNER JOIN Orion.Nodes N ON N.Caption = HAPM.Hostname
ORDER BY HAPM.DisplayName
-
Re: HA host going into down(red)state
wlouisharris Mar 29, 2019 2:52 PM (in response to dsimpkins)It's frustrating for sure. I was wondering if this was a fixed issue in NPM 12.4 but apparently not. The problem seemed to surface around the time we added additional pollers. The new pollers we added do not utilize HA.
I have also noticed if you do not space out the HA restart on the main poller pool it will trigger a failover.
I have also noticed I have to force stop HA. I use this Powershell script:
ForEach ($system in Get-Content H:\tools\sec-ha-servers.txt)
{
invoke-command -ComputerName $system -ScriptBlock {Stop-Process -Force -processname Solarwinds.Highavailability.Service}
}
I am going to stay persistent with support on this. We would rather not have to trigger an automation in order to keep HA in working order.
-
-
-
Re: HA host going into down(red)state
wlouisharris Mar 29, 2019 3:43 PM (in response to wlouisharris)Also, here is a log message I found in the HighAvailability.service log in C:\ProgramData\SolarWinds\Logs\HighAvailability
It references skipping synchronization because of pending sync task. The time stamp of the log is 3:09pm but you see the last sync completed was at 6:38am the same day; hence causing the "down" state.
PoolSyncCoordinator:
PoolNodes:
SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode - LastSyncCompletedTimestamp=29-03-2019 06:38:11.358, SyncTasksInProgressCount=2
POOL MEMBER [SlaveServiceProxy net.tcp://wpcaf02:17777/HA/] - LastSyncCompletedTimestamp=29-03-2019 07:09:12.571, SyncTasksInProgressCount=1
ShouldSynchronize:True
NoContactWithPoolMaster:True
2019-03-29 15:09:28,294 [HighAvailabilityServiceContainerThread] INFO SolarWinds.HighAvailability.Kernel.PoolManagement.PoolSyncCoordinator - >WPC6BBD-5-MainPoller: Skipping SolarWinds.HighAvailability.Kernel.PoolManagement.DataStorageNode synchronization, because of pending sync task (PoolId: 7)
-
Re: HA host going into down(red)state
aLTeReGoApr 1, 2019 12:31 PM (in response to wlouisharris)
wlouisharris, do you by chance have a support case open for this issue? If so, would you be willing to share your case number?
-
Re: HA host going into down(red)state
wlouisharris Apr 1, 2019 2:19 PM (in response to aLTeReGo)Hey alterego - here is the support case # -
CASE # 00276075 - HA POOLS SWITCHING TO UNKOWN STATE
-
Re: HA host going into down(red)state
aLTeReGoApr 2, 2019 10:45 AM (in response to wlouisharris)
I have reached out to support management and requested that your case be escalated.
-
-
Re: HA host going into down(red)state
dsimpkinsApr 2, 2019 3:58 AM (in response to aLTeReGo)
aLTeReGo, I had a ticket open a while back for mine but it still reoccurs
Case # 00237473 - HA Service on multiple pools reports failed
-
-
-
Re: HA host going into down(red)state
wlouisharris Apr 1, 2019 2:21 PM (in response to wlouisharris)The only thing we have done so far with support is pull diagnostic files. All our servers are up to date with current Microsoft patches now so we have eliminated that as a possibility.
-
Re: HA host going into down(red)state
wlouisharris Apr 5, 2019 10:25 AM (in response to wlouisharris)-
Re: HA host going into down(red)state
wlouisharris Apr 5, 2019 10:26 AM (in response to wlouisharris)I changed the query a bit. It looks like status=1 is healthy in the HA.poolmembers table. We basically wrote an alert looking for this condition.
SELECT HostName, Status, LastHeartbeatTimestamp
FROM Orion.HA.PoolMembers
WHERE Status != '1'
-
-
Re: HA host going into down(red)state
wlouisharris May 7, 2019 3:41 PM (in response to wlouisharris)Update on this case. I am still working with support. We really have not found any evidence pointing to root cause. I really cannot say what the origin of the problem is at this point but will update the post. I did work with my contact at Solarwinds to write a query that will show us when the lastheartbeat time stamp is greater than 1 hour of the current UTC. This alerts us when the condition occurs. We do see some cases where status=1 and we have HA in critical state.
-
-
Re: HA host going into down(red)state
aLTeReGoMay 23, 2019 8:10 PM (in response to dsimpkins)
Have you tried restarting services on one of those pollers? The lack of a heartbeat for that long sounds like connectivity to the database was lost or a service stopped.
-
Re: HA host going into down(red)state
dsimpkinsMay 24, 2019 10:17 AM (in response to aLTeReGo)
Restarting the services is the only way to get them back to green, although when restarting the services it is 50/50 whether it will stop and generally requires the process to be killed on each node in a pair to bring the pool to green.
Next time it does it occurs i'll try and grab some logs to see if that gives any further clues.
-
Re: HA host going into down(red)state
aLTeReGoMay 24, 2019 10:23 AM (in response to dsimpkins)
There are a couple of similar issues reported by other customers which were addressed in Orion Platform 2019.2, included in NPM 12.5 which can be downloaded now from your Customer Portal.
-
Re: HA host going into down(red)state
dsimpkinsMay 24, 2019 11:32 AM (in response to aLTeReGo)
Nice
Just need my maintenance sorting (today hopefully) and i can get the next RC downloaded to take a look.
Was there anything particular that was addressed that you can share details on? would be good to understand what was the cause and if our environments was a contributing factor.
-
Re: HA host going into down(red)state
aLTeReGoMay 24, 2019 4:34 PM (in response to dsimpkins)
It's a little too in the details to explain, and quite honestly without a lot of Orion development context, probably even more difficult to understand. If you're uncomfortable upgrading without validation that the issue you're experiencing is the same as the ones addressed in Orion Platform 2019.2, then I recommend opening a case with support. They will be able to investigate the issue and work with engineering to determine if the issue you're encountering is related.
-
Re: HA host going into down(red)state
dsimpkinsMay 26, 2019 10:16 AM (in response to aLTeReGo)
Not a problem, to be honest hearing the issue should be resolved is enough.
If it still occurs then will raise a case to see if our issue is a different one.
Thank you
-
Re: HA host going into down(red)state
dsimpkinsSep 9, 2019 2:08 AM (in response to dsimpkins)
1 of 1 people found this helpfulHi aLTeReGo,
Upgraded to 2019.2 HF2 a couple of weeks back and so far so good, haven't had a single issue with HA
Email alerts have been a little bit noiser on the main poller but i can live with that, thanks for the work on this.
-
-
-
-
-
-
-
-
-
Re: HA host going into down(red)state
wlouisharris Jun 3, 2019 3:49 PM (in response to wlouisharris)1 of 1 people found this helpfulUpdate on our case. We need to run validation for a few more days but I think we found the root cause. We are running SQL HA in 2 separate sites. Our DBA's had mistakenly set the replication to synchronous. Since the database VM's are in separate sites we need asynchronous replication. After configuring asynchronous replication in SQL we have not seen the HA nodes change to critical state and the HA table stop updating. We also setup the configuration manager to point to the FQDN of the SQL listener. I think the replication configuration may be the culprit. Will update the case once we are sure.
They key error we saw and that support pointed out was in the Solarwinds application event log:
og Name: SolarWinds.Net
Source: SolarWinds.SyslogTraps.SyslogService
Date: 4/23/2019 3:51:49 AM
Event ID: 4001
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: HOST-our-host
Description:
Service was unable to open new database connection when requested.
SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The remote computer refused the network connection.)
Connection string - Data Source=OUR_SQL_Instance ;Initial Catalog=SolarWindsOrion;Persist Security Info=False;User ID=SQL-dummy-user ;X;Max Pool Size=1000;Connect Timeout=20;Load Balance Timeout=120;Packet Size=4096;Application Name="Orion Syslog Service@SyslogService.exe";Workstation ID=Ourhost;MultiSubnetFailover=True
Event Xml:
-
Re: HA host going into down(red)state
wlouisharris Jun 10, 2019 3:16 PM (in response to wlouisharris)2 of 2 people found this helpfulWe did finally close our case. I have seen the our HA environment stay in healthy state for over 10 days now. So yes, if you have your HA pool go to critical state, look for any SQL errors in the Solarwinds event log and make sure SQL connectivity is not the root cause.
-