Main Polling Engine and APE Server Migration to Server 2022 cause half of our 800 App Monitors to go unknown

Hello, 

Thought I would see if anyone here maybe has any idea since SW support was no help. 

Current set up is, Server 2016 Main Polling Engine, 2016 APE, 2022 Web Server and another 2022 APE in another domain. 

I have tried migrating the two 2016 Servers to new 2022 servers using the same hostname and IP twice. 

I have attempted this twice now doing both the Main Polling Engine and then the APE. The migration seems to go fine and the website is up but then random application monitors go to an unknown state then some some back up then go unknown again. We poll them all with WMI using fixed port  24158, I have had to revert back both times due to all the false unknown alerts we got. We have around 800 app monitors and over 400 would go unknown.

First time I tried working with support and all they could come up with was to upgrade Orion to 2024.1 which I did. Then I tried again and the same thing happened except this time it was only around 130 app monitors. It doesn't matter which polling engine the node and the monitor is on but its only the two I migrated. Nodes with the agent installed work fine.

Also ALL the nodes except ones with agents on those two polling engines show Ethernet Adapter is in an unknown state. 

I am just not sure where to from here as its hard to troubleshoot since all hell is breaking loose while trying to do so.

 

Any ideas or help is greatly appreciated. All SW Support wants to do is collect the logs but I can't keep it down for the days and week it takes them to respond.

  • Please ask for an escalation and emphasize that your system is down. You can also reach out to your customer success rep/account manager and ask the same of them, but they may need your ticket number.

    If I understand you correctly, you are trying to migrate your existing windows  2016 servers to new windows 2022 based servers? If so, have you applied the port configurations for WMI to the new servers? I assume yes, but wanted to verify. You mentioned nodes that use the agent are reporting better. Are you able to switch some of the app templates away from wmi and use agents (if the agents are on those servers?) That will help isolate wmi and wmi port usage. There is also wireshark - you can run it on the affected polling engines and try to catch wmi traffic and see if it is failiing to get out or get a response from a server.  Wireshark is useful but brings A LOT of extra details when it collects data. 

    Have you tried running the diagnostics from the server itself? That will eliminate running it from the webconsole and allow you to babysit it. 

    You have tried the migration twice. Going to assume the following was checked over also. If not - there may be some tidbits that could help you. I dont see anything magically revealing in this page, but it has links to all the migration variations. 

    Migrate SolarWinds Platform products to a new server using the same IP and hostname

  • Thanks for the response, yeah the first time I did have it escalated to the down team at SW and had two guys look at it which I think one was a dev engineer. I let that ride for two days before I was told to revert back. They had me upload diagnostic logs and then took two weeks to get back to me with a response of try upgrading to 2024.1.  

    Yes I followed that migration doc to a T. The ones with agents work fine, it just makes no sense why only a a certain amount have issues when they are all using the same polling method and service account but then all have issues with the vmware adapter being unknown as well.. Probably something environmental but  its pretty frustrating to go through the migration to have to revert back. 

    I am not sure if I ran the active diagnostic from the servers but SW has had the full diagnostic from both new servers and didn't seem to find anything to try even. 

    Wireshark is a good idea, I did have our network team check for drops or anything like that and they didn't see anything

  • Is there any specific reason to migrate to new servers, can you not run in-place upgrades on 2016 systems to 2022?

  • Mark, Basically just for best practice purposes on a clean server. I did mention that to my manager but they scoffed at it, not something they like us doing but I might have talk them into it, I have done iin-place upgrades before.

    Have people had success doing that with Orion? 

  • I'd suggest that before you even worry about upgrade/ migration stuff you should get the 2022 servers up and running and make sure that they can successfully connect to some of the target servers using WBEMTest

    https://solarwindscore.my.site.com/SuccessCenter/s/article/Testing-WMI-Connectivity?language=en_US

    That would at least get you able to verify that there isn't something completely wrong with the WMI settings on the new hosts compared to what you had in the old ones. 

    Assuming that you don't have some fundamental issue jamming you up there I wonder if somehow you are running into a port exhaustion issue with the settings you've configured on the new servers.

    https://solarwindscore.my.site.com/SuccessCenter/s/article/Ephemeral-Port-Exhaustion?language=en_US

    Basically I wouldn't think of this as an Orion issue as much as I'd be chasing down why wmi is broken/unreliable between those new servers and the existing environment. 

  • Thanks for the response. I did run though the WBEMtest on the 2022 servers I have stood up and connected to a grip of systems and they all returned results. 

    I do slightly remember support looking at the port exhaustion as well and I don't think that was an issue either. All the systems now have no issues on the 2016 polling engines so it must be something on the new servers not configured properly but I don't see anything in any articles I have not found anything that needs to be done before the migration. Other than building the new servers with the correct resources I have done nothing else on them. All our nodes have WMIFixedPort set as well.  

    While I do agree it is likely environmental, feels like it is going to end up being something simple and stupid. Is there anything I need to do on these new servers before the install?  I cuoldnt see it being ports since over half of the app monitors still seem ok.

    I am not sure if this matters or these registry values get added by Orion during the install but I noticed this below.

    Top one is the new 2022 server and bottom one is the existing main polling engine which also matches the existing APE. I highlighted the two values that are not on the  2022 servers..

  • Yeah we did an in place upgrade  from 2012 to 2019 and all was fine. Done many in place upgrades on other systems and they have been fine.

  • Since you have a support contract I'd create a diagnostic bundle on all pollers including the original 2016 ones, then upload them to the customer portal - Orion Insights | SolarWinds Customer Portal

    After a few hours you'll be able to download a report for each poller which breaks down the resource usage etc, and will hopefully show you differences between old and new which might point fingers at what's wrong.

    As Marc says, Active Diagnostics can sometimes be your friend. Run it locally on all pollers again and look at the Advanced View for details of what it finds.

  • "but then all have issues with the vmware adapter being unknown as well" That is a loaded sentence. My first thought has to do with upgrading the vmware hardware and tools on the targetted nodes. Have you listed resources and noticed new vmware network adapters? That would at least explain that problem. I noted your response to Mesverrum about wbemtest and port exhaustion (hateful thing that). Generally, wbemtest rules out the service account and wmi if it worked. That leaves the solarwinds installation on that server. If wbemtest works while the server is not polling  - yeah, I tend to think that port exhaustion might be an issue. Also check your SQL connectivity. That will cause all sorts of random goofiness. And yeah, you can have port exhaustion when writing to the database. Look for LOTS of these in the windows event log: ID:4001 in the SolarWinds.Net event log. Source will be SWService. The message will start like this: Service was unable to open new database connection (attempt 4) when requested. If you see lots of them during certain times, and it matches the issues with monitoring. Look at sql backup jobs, etc.