I've opened a case with support (Case #940956), but while I'm waiting on a response about the diagnostic file I've uploaded I want to ask here. Running version 11.5.2 of NPM on a virtual server. Database is on a separate dedicated SQL server. We have 8 products total and are using 8 CPU's running at 2.00 GHz with 24 GB of RAM.
On Feb 12th we began getting inundated with latency alerts from various nodes. Investigating showed that they were not experiencing latency. We looked into Orion to determine the source and found that the CPU was completely pegged. Stopping all SolarWinds services causes it to return to almost no utilization, starting them will cause it to jump right back to 100%. Contacted support and they had us reinstall CoreInstaller, JobEngine, Job Engine.v2, InfomationService, and CollectorInstaller. Doing this fixes the problem for a couple days but then it seems to come back.
Trouble shooting we've tried indicates that starting up the SolarWinds.JobEngineWorker.v2 is the specific service that causes the load. We tried increasing the number of CPU's available from 8 to 16 and this worked for a couple days but now its back to the same issue even with double the recommended number.
This was helpful for understanding how NPM works
Our problem seems similar to these other issues:
This is the relevant info about how much stuff we're polling.
|Network Node Elements||1086|
|SAM Application Polling Rate||5% of its maximum rate.» Learn more|
|Routing Polling Rate||1% of its maximum rate.» Learn more|
|UnDP Polling Rate||0% of its maximum rate.» Learn more|
|Polling Rate||30% of its maximum rate.» Learn more|
|SAM Windows Scheduled Tasks Polling Rate||2% of its maximum rate.» Learn more|
|Hardware Health Polling Rate||15% of its maximum rate.» Learn more|
|Fibre Channel Polling Rate||0% of its maximum rate.» Learn more|
|Wireless Polling Rate||0% of its maximum rate.» Learn more|
|Wireless Heat Map Polling Rate||0% of its maximum rate.» Learn more|
|Total Job Weight||1890|
|Number of HW Health Monitors||478|
|Number of HW Health Sensors||10757|
This is a capture of all the Job Engine workers. Is it normal to have this many Job Engine Workers?
JobEngines.PNG 50.5 KB