Anyone else having job engine issues affecting polling.

We are on version 2023.2.0

What we have noticed since upgrading to this version is an issue where suddenly interface start to drop out of polling. Then randomly devices bounce in and out of polling. Then Agents start to disconnect the monitored servers were not touched. And then we no have unreliable polling because of this.

when I check the job engine logs i see this:

2023-05-25 09:38:45,014 [29] WARN  SolarWinds.JobEngine.AgentSupport.Routing.JobRouterAgentProxy - (null)  Error during 'Clear scm' call. SolarWinds.JobEngine.JobEngineCommunicationException: Unable to communicate with agent JobEngine on endpoint 6fb9d5fd-4531-4c6a-96a4-55e919fc12cc. Response to message 'Clear' was not received in '00:02:00'.

This is from top to bottom of the logs with a few other random warnings in between. This is not typical. Usually I look in the job engine and it's always normal operations so to see several logs and all them say the same exact thing on all pollers indicates the job engine is the problem. 

Now I've done a repair on on engine I've been working on. It didn't work.

I cleared the SDF files and had them regenerate. It didn't work. 

I cleared up a whole bunch of agents no longer in use. Cleared up credential issues. etc. I did a good effort on house cleaning. This didn't help.

I'm at a point I'm considering doing a complete clean uninstall on that engine and a reinstall. But kind of skeptical if that will help if the above did not. The Agent messaging service was the service that was having issues as well and that log has a bunch of warnings and errors too. 

Same with the collector. It almost seems as if the core polling components are all not working properly. Has anyone had similar issues?

Parents
  • I am experiencing similar polling frustrations on 2023.2.

    Since upgrading I now often experience issues where polling randomly stops for all nodes on a specific poller.

    When this occurs there's no indication in the Solarwinds poller health stats to indicate a problem. We are therefore having to manually monitor the LastSystemUpTimePollUtc values for our nodes in the SQL DB to spot this issue.

    Restarting the services on the affected poller resumes polling for nodes on that poller.

    I have a case open with Support and they have advised it is a known issue in version 2023.2 and their developers are currently working on a fix for this. I assume it's also still applicable on 2023.2.1.


    For anyone looking to upgrade to 2023.2 (or 2023.2.1) I would hold off!!

  • It's good to know we aren't alone.  Please share the word because it's a bug. We have 2 cases with support and we are even involved with their upper support management team and they are refusing to send this up to their developers. 

    I'm done butting heads with support.  

Reply Children
No Data