Main Polling Engine and APE Server Migration to Server 2022 cause half of our 800 App Monitors to go unknown

Hello, 

Thought I would see if anyone here maybe has any idea since SW support was no help. 

Current set up is, Server 2016 Main Polling Engine, 2016 APE, 2022 Web Server and another 2022 APE in another domain. 

I have tried migrating the two 2016 Servers to new 2022 servers using the same hostname and IP twice. 

I have attempted this twice now doing both the Main Polling Engine and then the APE. The migration seems to go fine and the website is up but then random application monitors go to an unknown state then some some back up then go unknown again. We poll them all with WMI using fixed port  24158, I have had to revert back both times due to all the false unknown alerts we got. We have around 800 app monitors and over 400 would go unknown.

First time I tried working with support and all they could come up with was to upgrade Orion to 2024.1 which I did. Then I tried again and the same thing happened except this time it was only around 130 app monitors. It doesn't matter which polling engine the node and the monitor is on but its only the two I migrated. Nodes with the agent installed work fine.

Also ALL the nodes except ones with agents on those two polling engines show Ethernet Adapter is in an unknown state. 

I am just not sure where to from here as its hard to troubleshoot since all hell is breaking loose while trying to do so.

 

Any ideas or help is greatly appreciated. All SW Support wants to do is collect the logs but I can't keep it down for the days and week it takes them to respond.

Parents
  • Did you get a cause?

    Feels nextpollinthepast-ey to me

    Would guess that, then corrupt jobsdb, then maybe that setting for legacy powershell if it were all powershell, looks node-layer though.

  • I am not sure what nextpollinthepast-ey means exactly, can you maybe elaborate a bit if possible? We only have a handful of nodes using powershell and before they finally let us fix the legacy powershell setting instead of updating the config file, that's been fine

  •  There were several problems that appears across the two different set of logs. I'll try and summarize some of the symptoms/problems here, but we will want to open a new case to see if we can really get to the bottom of all of these issues.

    1. SAM logs are flooded with WinRM errors.  You'll want to refer to the following documentation for the proper states to trust the proper domains to fix this problem.  This is likely the reason why some of the SAM component utilizing WinRM are not working.
    2. The configuration wizard logs show signs it was not able to determine the proper ssl certificate.  This is likely partial the reason why the certificate was not bound properly and multiple steps needed to be taken to fix it.  It was hard to tell from the logs, but I can only assume the hostname/FQDN did not resolve to the bound ip address.  Ideally, we need to figure out why the config wizard could not properly detect the cert in the first place.
    3. There is an underlying problem with SAM application where several of these applications are stating "Due to licensing constraints, application polling is disabled. To obtain a larger license, call SolarWinds Customer Service.".  It seems clear from the logs, there is enough licenses so this appears to be a software bug.  I believe it was stated the rebuild fixed the licensing problem which in turn may have helped this problem.  The overall question we are still trying to root cause is why we cannot properly checking the licensing in certain scenarios.
  • There's a value called nextpoll in the nodes table, when it ends up in the past somehow its a sign scheduling has got messed up. This usually causes grey's from the nodes table down

  • Thanks for the reply. The website issue was an odd one because no matter how I had the IIS bindings set I always got a HTTP 500 error on the Main PE. Same cert and bindings worked on the new server until I failed back. The first time I tried the migration the website was not an issue, that happened after upgrading to 2023.4.

    The licensing store issue was also fixed awhile ago before I tried the latest migration. We had switched to HCO.

    The WinRM things is interesting though as I went through a few nodes I remember having issues and the WinRM box is not checked but anything that has been added over the last couple years seems to have it checked.

     Maybe that would explain the SAM polling issues as only 25% of our 800 app monitors had the issue and no nodes like our DC's using the agent had the issue,

    That still would not explain why the every single node being polled with WMI had the Ethernet Adapter issue.

  •  I am interested in the certificate issue.  We did make some changes in the later version to make sure certificate binding are working properly.  In the past release, we had issues where the binding would fail and caused a blank certificate bind scenario.  This caused other problems.  This could be related, but we need to investigate more.

    If you are still interested, I can engage the support team to open a new case and help move this forward.

  • For the ethernet adapter issue - check if you have charted CPU data for the last hour for the parent node

Reply Children
No Data