Good afternoon all,
I'm having an issue with my SolarWinds server and could use some help figuring out the root cause. I'm not positive this is related to SAM, so hopefully it's in the right forum to begin with, but any assistance would be appreciated. Thank you in advance!
SolarWinds server running on a VM inside a vSphere deployment, with a separate SQL server also running on a VM on the same host cluster. Both servers are running Windows Server 2016, are fully patched, and have no networking issues between them, including IP connectivity and SQL database authentication. The SW server has 8 CPU cores assigned to it and 32 GB of RAM. This is the only polling engine, monitoring about 4,000 nodes, give or take, including routers, switches, and virtual servers.
SolarWinds server CPU will go from ~15% up to 100% utilization (22 GHz) over the course of about three minutes. System becomes unresponsive due to lack of resources and can only be accessed following a hard reboot (guest OS restart fails). CPU stays at 100% indefinitely (current record is about six months... don't ask...). Top talker is the LSASS.exe process, which is using 99%+ of the CPU.
The first thing we did was verify that LSASS was running in the correct location (C:\Windows\System32\) and was the digitally signed executable, to rule out malware. Next we verified the appropriate On Access Scan exclusions were configured in McAfee ES Threat Prevention module. Changing the OAS settings had no effect, including disabling it entirely. The system will be stable for several hours following a reboot, then shoots to 100%. It looks like it happens at generally the same times (about 4pm or about 11pm), but not always, suggesting a scheduled task or event of some kind. Over the past two weeks we have systematically enabled and disabled various SolarWinds-specific services to isolate the problem, including leaving all SW services disabled over a weekend to see if a GPO or security scan might be the cause. At this point I have verified with complete certainty that the Job Engine v2 service is the culprit. I have already tried uninstalling and reinstalling the JobEngine component with no luck.
My initial theory was that one of the modules (NCM or SAM maybe) is trying to poll a bunch of devices with invalid credentials. But unless I'm just not understanding how the JobEngine service relies on LSASS then I don't see why that would stress the polling engine. I also would expect a lot more even logs for failed logins, and while there are some I haven't seen enough to prove this. I have disabled all scheduled jobs in NCM and updated all of the credentials I could find. Unfortunately I inherited this deployment from the last guy without any turnover so I'm doing a lot of "exploratory learning." Please ask any questions that come to mind; I most likely won't know the answer up front but I'll do what I can to find out. Thanks!
Solved! Go to Solution.
That would make a lot of sense actually; we did have some database connectivity issues due to a credential mismatch a while back. We suck. I'll try this out today and see how it goes. Fingers crossed.
Update: I had ~140 .MQ files on each of my SW servers (two separate, isolated networks; one server each). I followed the guide, cleared out the MQ files, and restarted the message queuing service. I started the JobEngine service as well and will monitor for a few days. Thanks! More to follow.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.