Anyone else having job engine issues affecting polling.

We are on version 2023.2.0

What we have noticed since upgrading to this version is an issue where suddenly interface start to drop out of polling. Then randomly devices bounce in and out of polling. Then Agents start to disconnect the monitored servers were not touched. And then we no have unreliable polling because of this.

when I check the job engine logs i see this:

2023-05-25 09:38:45,014 [29] WARN  SolarWinds.JobEngine.AgentSupport.Routing.JobRouterAgentProxy - (null)  Error during 'Clear scm' call. SolarWinds.JobEngine.JobEngineCommunicationException: Unable to communicate with agent JobEngine on endpoint 6fb9d5fd-4531-4c6a-96a4-55e919fc12cc. Response to message 'Clear' was not received in '00:02:00'.

This is from top to bottom of the logs with a few other random warnings in between. This is not typical. Usually I look in the job engine and it's always normal operations so to see several logs and all them say the same exact thing on all pollers indicates the job engine is the problem. 

Now I've done a repair on on engine I've been working on. It didn't work.

I cleared the SDF files and had them regenerate. It didn't work. 

I cleared up a whole bunch of agents no longer in use. Cleared up credential issues. etc. I did a good effort on house cleaning. This didn't help.

I'm at a point I'm considering doing a complete clean uninstall on that engine and a reinstall. But kind of skeptical if that will help if the above did not. The Agent messaging service was the service that was having issues as well and that log has a bunch of warnings and errors too. 

Same with the collector. It almost seems as if the core polling components are all not working properly. Has anyone had similar issues?

Parents Reply Children
  • No, I have not because we have manually checked everything. We have multiple tickets with support many escalated all the way to the top and support doesn't believe it to be environmental. Which is something we share or it would of shown itself more widely. This is in the mechanism of the software itself. Where there is no log that explains why it happens but it does. 

    The biggest thing I've found is that this wasn't an issue at all on the previous version. And had I known this new version has polling issues I would of not upgraded or if I caught this sooner I would have reverted everything back to the previous version. But I only figured it out after my server team deleted our backups we had done previous to upgrading.

    Digging around thwack I'm finding the issue isn't only me there are others with very similar if not the same problem we are seeing. All different things that keep further proving a bug of sorts on this version. If it was environmental the issue would of just been us. 

    That's why I'm going everywhere I can and trying all the KB articles etc to try to fix this. Or get the attention of developers so they can look into this.