We have a really strange problem. Our environment is:
Win 2019 NPM 12.5 server fully patched, newly built.
Win 2019 SQL 2017server fully patched, newly built.
All servers running on VMWare ESXi 6.5 and speced well over the requirements for our size of deployment, all running on SSDs.
Firewall in between them with all required ports open as per Solarwinds tech docs. AV software set to exclude all SW and SQL sensitive folders and files.
Our problem is that the NPM server is randomly unable to talk to the SQL server and it seems other servers (Win DC's, DNS, etc) on the network. Typical SQL error is Event ID 4001 with text as follows:
Service was unable to open new database connection when requested.
SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)
Connection string - Data Source=XXXXXXX;Initial Catalog=SolarWindsOrion;Persist Security Info=False;User ID=QQQQQQQ;Password= ;Max Pool Size=1000;Connect Timeout=20;Load Balance Timeout=120;Packet Size=4096;Application Name=SolarWinds.InformationService.ServiceV3@domain-Orion;Workstation ID=YYYYYYY
We've run Wireshark captures which show increasing amounts of TCP Retransmissions, TCP out of order and TCP Dup Acks to the SQL server and to other severs. Eventually, it seems the NPM server just decides enough is enough and partially shuts down the TCPIP stack. At that point, name resolution stops working (even if we add entries to the hosts file), pinging by IP still works, if we execute IPCONFIG, the command prompt hangs without showing the IP information. At the time it hangs, we see a single DCOM error ID 10010 "The server {4991D34B-80A1-4291-83B6-3328366B9097} did not register with DCOM within the required timeout" (relating to BITS) followed a few seconds later by a NETLOGON error ID 5783 "The session setup to the Windows Domain Controller \\XXXXXXX for the domain PING-NS is not responsive. The current RPC call from Netlogon on \\YYYYYYY to \\XXXXXXX has been cancelled." The DCOM error repeats at intervals and we get Group policy processing fails as well all of which indicates the server isn't talking to the domain/other servers on the network. At this point the only thing to do is reboot the server after which the cycle repeats.
We have been through everything we can think of (firewalls especially) and can find nothing wrong. SW tech support are also baffled. We have even built a second Win2019 NPM server and that shows the same behaviour.
Any ideas would be gratefully received!