Anyone else having job engine issues affecting polling.

We are on version 2023.2.0

What we have noticed since upgrading to this version is an issue where suddenly interface start to drop out of polling. Then randomly devices bounce in and out of polling. Then Agents start to disconnect the monitored servers were not touched. And then we no have unreliable polling because of this.

when I check the job engine logs i see this:

2023-05-25 09:38:45,014 [29] WARN  SolarWinds.JobEngine.AgentSupport.Routing.JobRouterAgentProxy - (null)  Error during 'Clear scm' call. SolarWinds.JobEngine.JobEngineCommunicationException: Unable to communicate with agent JobEngine on endpoint 6fb9d5fd-4531-4c6a-96a4-55e919fc12cc. Response to message 'Clear' was not received in '00:02:00'.

This is from top to bottom of the logs with a few other random warnings in between. This is not typical. Usually I look in the job engine and it's always normal operations so to see several logs and all them say the same exact thing on all pollers indicates the job engine is the problem. 

Now I've done a repair on on engine I've been working on. It didn't work.

I cleared the SDF files and had them regenerate. It didn't work. 

I cleared up a whole bunch of agents no longer in use. Cleared up credential issues. etc. I did a good effort on house cleaning. This didn't help.

I'm at a point I'm considering doing a complete clean uninstall on that engine and a reinstall. But kind of skeptical if that will help if the above did not. The Agent messaging service was the service that was having issues as well and that log has a bunch of warnings and errors too. 

Same with the collector. It almost seems as if the core polling components are all not working properly. Has anyone had similar issues?

  • Had a look in my C:\ProgramData\SolarWinds\JobEngine.v2\Logs\SolarWinds.JobEngineService_v2023_2.log, no matching record, is that's the one you were looking at might be an environmental thing

  • hmm. That actually doesn't surprise me. I'm going to try some repairs later tonight. Hopefully it'll fix the issue. Thanks for the help

  • I am experiencing similar polling frustrations on 2023.2.

    Since upgrading I now often experience issues where polling randomly stops for all nodes on a specific poller.

    When this occurs there's no indication in the Solarwinds poller health stats to indicate a problem. We are therefore having to manually monitor the LastSystemUpTimePollUtc values for our nodes in the SQL DB to spot this issue.

    Restarting the services on the affected poller resumes polling for nodes on that poller.

    I have a case open with Support and they have advised it is a known issue in version 2023.2 and their developers are currently working on a fix for this. I assume it's also still applicable on 2023.2.1.

    For anyone looking to upgrade to 2023.2 (or 2023.2.1) I would hold off!!

  • It's good to know we aren't alone.  Please share the word because it's a bug. We have 2 cases with support and we are even involved with their upper support management team and they are refusing to send this up to their developers. 

    I'm done butting heads with support.  

  • Just realized you mentioned that they told you it's a known issue. Because they never told me on 2 tickets I have with them. What's your ticket number if you don't mind me asking? I'd like to mention it on my ticket so they can correlate this. 

  • Would you have any further information on this by chance? Is there any issues that are known with the current NPM? Apparently I'm not the only one having unreliable polling in our environment. And I was just curious to know if possibly there is anything known and anything with development currently? Looks like someone on my thread mentioned that they where told development is working on a fix?? 

    I just wanted to ask as this is extremely impactful and currently gives us no trust in the data polled. Polling just keeps stopping on its own. And a long list of things have all been tried and every last attempt to fix has failed. 

  • Have you tried gathering the logs and then submitting Orion Insights?  Just to check if there would be potentially environmental issues that you can eliminate to improve the Polling?

  • No, I have not because we have manually checked everything. We have multiple tickets with support many escalated all the way to the top and support doesn't believe it to be environmental. Which is something we share or it would of shown itself more widely. This is in the mechanism of the software itself. Where there is no log that explains why it happens but it does. 

    The biggest thing I've found is that this wasn't an issue at all on the previous version. And had I known this new version has polling issues I would of not upgraded or if I caught this sooner I would have reverted everything back to the previous version. But I only figured it out after my server team deleted our backups we had done previous to upgrading.

    Digging around thwack I'm finding the issue isn't only me there are others with very similar if not the same problem we are seeing. All different things that keep further proving a bug of sorts on this version. If it was environmental the issue would of just been us. 

    That's why I'm going everywhere I can and trying all the KB articles etc to try to fix this. Or get the attention of developers so they can look into this.