Anyone else having job engine issues affecting polling.

We are on version 2023.2.0

What we have noticed since upgrading to this version is an issue where suddenly interface start to drop out of polling. Then randomly devices bounce in and out of polling. Then Agents start to disconnect the monitored servers were not touched. And then we no have unreliable polling because of this.

when I check the job engine logs i see this:

2023-05-25 09:38:45,014 [29] WARN  SolarWinds.JobEngine.AgentSupport.Routing.JobRouterAgentProxy - (null)  Error during 'Clear scm' call. SolarWinds.JobEngine.JobEngineCommunicationException: Unable to communicate with agent JobEngine on endpoint 6fb9d5fd-4531-4c6a-96a4-55e919fc12cc. Response to message 'Clear' was not received in '00:02:00'.

This is from top to bottom of the logs with a few other random warnings in between. This is not typical. Usually I look in the job engine and it's always normal operations so to see several logs and all them say the same exact thing on all pollers indicates the job engine is the problem. 

Now I've done a repair on on engine I've been working on. It didn't work.

I cleared the SDF files and had them regenerate. It didn't work. 

I cleared up a whole bunch of agents no longer in use. Cleared up credential issues. etc. I did a good effort on house cleaning. This didn't help.

I'm at a point I'm considering doing a complete clean uninstall on that engine and a reinstall. But kind of skeptical if that will help if the above did not. The Agent messaging service was the service that was having issues as well and that log has a bunch of warnings and errors too. 

Same with the collector. It almost seems as if the core polling components are all not working properly. Has anyone had similar issues?

Parents
  • We are having similar problems. I got a support case open. So far we made it past the basic questions and now waiting on a response. Reading the case notes for SolarWinds Platform 2023.2.1, there is mention of a fix called "Alerting issues that occurred after the upgrade to 2023.2 were addressed." This is vague since if SNMP is not polling, some alerts are not working. 

    All the SolarWinds services are running and when I restart the services, it normally comes back up. This is a pain since as had a few major event happen on campus and SolarWinds did not pick it up as designed since the SNMP Polling stops randomly. 

Reply
  • We are having similar problems. I got a support case open. So far we made it past the basic questions and now waiting on a response. Reading the case notes for SolarWinds Platform 2023.2.1, there is mention of a fix called "Alerting issues that occurred after the upgrade to 2023.2 were addressed." This is vague since if SNMP is not polling, some alerts are not working. 

    All the SolarWinds services are running and when I restart the services, it normally comes back up. This is a pain since as had a few major event happen on campus and SolarWinds did not pick it up as designed since the SNMP Polling stops randomly. 

Children
  • Mention this thread to support. I started to try and consolidate similar instances of everyone having polling issues so they could investigate on their end. 

    They have identified there is a polling issue as it appears like everyone is reporting the same thing once service is restarted it works for a bit then it stops. But they haven't figured out the root of this yet as in different environments different things are causing the same or similar errors making it hard to figure out what exactly is the root of the issue but they appear to be honing in on a possible solution. At least it's what I understand based on what they have told me. 

    But feel free to mention this thread to them so they can start putting all of our cases together under these polling issues. They have grabbing performance counter logs manually on our services to try and catch the error as in our case our logs aren't catching much of anything yet it's clear as day that polling randomly just keeps stopping regardless if the logs are reporting anything or not. 

    Hopefully they can catch it with these manual logs to understand what's going on. I've been manually keeping an eye on the next poll times on the pollers if they all start to fall back that's usually when I restart the job engine on that particular poller and it gets things going again. Haven't figured out how to automate this yet in a way that works correctly. So I've been manually doing this. 

    I just hope solarwinds figures this out quickly. As it's becoming a rather heavy effort to keep things stable.