I've been experiencing some problems with the Job Engine v2 on one of my poller servers. Every so often, the number of Jobs Running will stall for up to 30 seconds, causing no other jobs to begin execution. Usually, the scheduler will spawn a new worker process to compensate. But, this stalling repeats and another process is spawned. This can happen a lot in succession and increase the worker processes by double. The number of Jobs Queued finally starts to reduce after a minute or two and the worker processes will end about 10 to 15 minutes later.
This would seem to be appropriate behavior, but it is causing a slow memory consumption on the poller finally resulting in the need to restart the server about every three weeks. The attached image shows the Windows Performance Monitor on some of the Job Engine metrics during a relatively calm spike. I have seem spikes happen such that the entire 15 minutes window shows running jobs stalling.
I need to find out what job is causing the stalling. I'm figuring that the stalling may be due to a network timeout, so the 30 seconds is my clue. I'd like to get a report of jobs and their durations. The 30 seconds ones should be easy to pick out and help me find the true problem.
Many thanks in advance.
That is what I cannot tell. We have Orion 2013.1.0, NCM 7.1.1, SAM 5.5.0, NPM 10.5, IPAM 4.0, VNQM 4.0.1, NTA 3.11.0, IVIM 1.6.0 installed in the environment. I have an upgrade planned in the near future to bring us up to the latest versions of each component. But, this issue with the job engine has persisted through the last 2 upgrades.
As far as I've been able to find, the Job Engine is a black box. I can see metrics on the number of workers and jobs going through it, but no details about the job breakdowns or even how the jobs correlate to the configuration within the application. I can make some guesses about the correlation, but that doesn't help much to find any specific details on the culprit.
I can see log files piling up with what appear to be process id numbers appended, but they are of zero length, no content to be found. Since I haven't seen any others with this problem, I suspect the problem being in something custom that was written or a particular configuration issue; not necessarily a problem with the job engine directly.
Please let me know of any knowledge you have.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.