Hi ,
I am posting this issue when nothing worked out, means everything - Support case, Windows troubleshoot, Logs capturing etc. etc.
Environment -
1. NPM, SAM, SRM on Primary Poller. Configuration - 160 GB RAM and 16 Core CPU. OS - 2008 R2 SP1.
2. SQL DB - On Separate Server - 256 GB RAM and 16 COre CPU. OS - 2008 R2 SP1.
3. Three more Additional Polling Engines for NPM and SAM. SRM is also there but we are not using SRM so far. Configuration for each APE : 40 GB RAM, 16 Core CPU. OS - 2012 Standard.
Issue -
APE server is going in Hung state every 3-4 days that results in Solarwinds to stop monitor, Bulk of false email alert of past events and DCOM fail events. That happens in Evening time only of EST time i.e. arround 4-6 PM EST.
Initially we had only one APE and when we distributed the load of Primary poller it worked fine for one week but after that started going Hung state, nothing could be done except reboot and monitoring started worked fine.
After having a case with Solarwinds Support Team, at last they suggested to rebuild another APE machine. We did builted another APE machine with same configuration as emntoned above for APE's and moved all the nodes form that APE to 2nd APE. Now issue started with new APE.
We had multiple cases where Solarwinds had said that this is related to System issue not with Solarwinds Product.
Lot of Subscription errors - but they have stopped after upgrading Solarwinds to 2017 SP2. Polling rate is also normal.
I had another APE (Same configuration , and it is on same host and same LAN) assign to another Solarwinds instance and that has NPM, NTA and IPAM only and APE is running fine. No hanging issue so far after bulting.
If that issue is with system host or configuration then it should happen with other APE on another Solarwinds because all are builted at same time with same configurations.
Steps that we have done so far.
1. Registry modification for TCP/Port Excaution is done.
2. Exclude Solarwinds Folders from Antivirus Scanning.
3. Re-building APE's.
4. Increased Resources.
5. Disabled all Down or Unknown AppInsight for SQL and IIS.
6. Un-Managed all down nodes and those nodes which are not responding to WMI or SNMP.
7. Upgraded Solarwinds platform to latest version i.e. Orion 2017 SP2 with NPM 12.1, SRM 6.4 and SAM 6.4
Solarwinds Support Case Numbers .
118543
1189038
Windows team has verified everything and nothing could be find, we tried building new server but same issue over there. So finally Solarwinds has said no issue at application level and Widnows team has said no issue at Server level. We are now stuck in between, We don't know what is causing Solarwinds APE servers to go in hung state in a same pattern i.e. after every 3-4 days in evening time i.e. 4 -6 PM EST.
Please help us to find out the root cause.