Anyone else having job engine issues affecting polling.

We are on version 2023.2.0

What we have noticed since upgrading to this version is an issue where suddenly interface start to drop out of polling. Then randomly devices bounce in and out of polling. Then Agents start to disconnect the monitored servers were not touched. And then we no have unreliable polling because of this.

when I check the job engine logs i see this:

2023-05-25 09:38:45,014 [29] WARN  SolarWinds.JobEngine.AgentSupport.Routing.JobRouterAgentProxy - (null)  Error during 'Clear scm' call. SolarWinds.JobEngine.JobEngineCommunicationException: Unable to communicate with agent JobEngine on endpoint 6fb9d5fd-4531-4c6a-96a4-55e919fc12cc. Response to message 'Clear' was not received in '00:02:00'.

This is from top to bottom of the logs with a few other random warnings in between. This is not typical. Usually I look in the job engine and it's always normal operations so to see several logs and all them say the same exact thing on all pollers indicates the job engine is the problem. 

Now I've done a repair on on engine I've been working on. It didn't work.

I cleared the SDF files and had them regenerate. It didn't work. 

I cleared up a whole bunch of agents no longer in use. Cleared up credential issues. etc. I did a good effort on house cleaning. This didn't help.

I'm at a point I'm considering doing a complete clean uninstall on that engine and a reinstall. But kind of skeptical if that will help if the above did not. The Agent messaging service was the service that was having issues as well and that log has a bunch of warnings and errors too. 

Same with the collector. It almost seems as if the core polling components are all not working properly. Has anyone had similar issues?

Parents Reply Children
  • thanks Tony, are you aware of a workaround or patch becoming available?

  • What can be considered a "Large environment" for this to happen?

  • Thanks Tony. I have a ticket with support about this that is currently being investigated by application engineers and your devlop teams. Per support since this was just recently uncovered they don't have answers yet. 

    And mostly because yes the high level problem is issues with polling but different environments are seeing different issues according to support. 

    In our issue specifically, our logs aren't even catching and logging an issue even with the logs adjusted to show everything. It only shows correct functions. But then again some log files where not even being written too and they issue is very strange and weird where for example a device may poll device information like cpu and memory but it won't poll the interfaces then later the interfaces poll but device information won't poll then some pollers just stop polling all together. 

    It's like there are random poll cycles that partially works for some things but then fails. We did notice in our environment that there is alot of events in the event viewer that seem to indicate the worker process of the job engine is crashing. But again no logs back this up. 

    It's a intermittent partially working error that has no reason to it. And the event viewer and system logs have no leads on them. Meaning support has had to setup manual troubleshooting steps to see if we can catch the error as it happens and record it. 

    Support has found several active bug reports internally all related to job engine problems and they have passed them all to developers to see if we can find what the heck is happening. 

    Support stated that when something fails it should be all or nothing. Not somethings failing and the rest works and then later the parts that don't work start to work but the parts that were working stops working. 

    It's a head scratcher because we don't have any leads. So we are having to find creative ways to try and figure this out.  

  • Tony,

    I've been at this with support for a long while now. They have Application Engineers and Developers both trying to figure this out because the root cause appears to be different between different environments out there according to what I've heard from the folks working on my ticket. 

    For example they explained that the outcome on our environment is similar to that internal ticket you mentioned however though the cause is different from that internal ticket. Kind of like similar issue but caused by different factors instead of just one factor. 

    It's agreed that the issues have similarities but it's also apparent there is more than one cause that causes this to happen. They are working with me to gather more information to try and figure it out. 

    The general agreement though seems to be that yes the software indeed does have a polling problem it's just that no one knows exactly what the main root of this issue is at least from what I've heard. They seem to to be getting closer to figuring it out though. 

  • How about a hotfix as it's causing outages of critical monitoring infrastructures...

  • Support provided me with a potential workaround from their developers which may help in reducing the frequency of this issue occurring.

    It involves modifying the SWJobEngineSvc2.exe.config file... (use at your own risk!)

    1. Backup SWJobEngineSvc2.exe.config file in {SolarWindsInstallDirectory}\SolarWinds\Orion.
    2. Edit SWJobEngineSvc2.exe.config and add parameter jobExecutionCount="10000000" to jobSchedulerSettings_Full tag. It should look like this: "<jobSchedulerSettings_Full jobExecutionCount="10000000" …additional parameters" (see below).
    3. Restart JobEngine service.
    4. In case of any issue revert file and restart service again.

    If that didn't work they also suggested setting up a Windows task scheduler job to restart the JobEngine service every few hours.

    I applied the suggested SWJobEngineSvc2.exe.config modification to all my pollers a week ago and the polling issue hasn't re-occured (yet!).

  • This is what support also recommended we do to the SWJobEngineSvc2.exe.config file. I made this yesterday. Now it is just a waiting and see. 

  • We have a 2023.2.1 upgrade scheduled for tomorrow.  Is this issue occurring for all large environments?

  • We have a moderate sized environment and this issue has been on-going since we upgraded to 2023.2. I can't speak for anyone else but I'm pretty sure they can chime in if they are experiencing this too. 

    It is know to solarwinds. So it is a know issue that they are actively trying to solve. If I were you I'd wait just to be on the side of caution. But if you determine to move forward I'd say hang on to those backups. If the problem does happen you'd want them to revert back to the current version you have.

    Just my opinion.