Anyone else having job engine issues affecting polling.

We are on version 2023.2.0

What we have noticed since upgrading to this version is an issue where suddenly interface start to drop out of polling. Then randomly devices bounce in and out of polling. Then Agents start to disconnect the monitored servers were not touched. And then we no have unreliable polling because of this.

when I check the job engine logs i see this:

2023-05-25 09:38:45,014 [29] WARN  SolarWinds.JobEngine.AgentSupport.Routing.JobRouterAgentProxy - (null)  Error during 'Clear scm' call. SolarWinds.JobEngine.JobEngineCommunicationException: Unable to communicate with agent JobEngine on endpoint 6fb9d5fd-4531-4c6a-96a4-55e919fc12cc. Response to message 'Clear' was not received in '00:02:00'.

This is from top to bottom of the logs with a few other random warnings in between. This is not typical. Usually I look in the job engine and it's always normal operations so to see several logs and all them say the same exact thing on all pollers indicates the job engine is the problem. 

Now I've done a repair on on engine I've been working on. It didn't work.

I cleared the SDF files and had them regenerate. It didn't work. 

I cleared up a whole bunch of agents no longer in use. Cleared up credential issues. etc. I did a good effort on house cleaning. This didn't help.

I'm at a point I'm considering doing a complete clean uninstall on that engine and a reinstall. But kind of skeptical if that will help if the above did not. The Agent messaging service was the service that was having issues as well and that log has a bunch of warnings and errors too. 

Same with the collector. It almost seems as if the core polling components are all not working properly. Has anyone had similar issues?

Parents
  • I thought this was just my issue since we recently migrated to Google Cloud Virtual Environment (GCVE).  I know there were firewall issues with this migration and thought I had everything working but then I noticed my reports I created had no data.  WPM's were going in the unknown status.  Maybe other issues that I haven't uncovered yet.  Anyway I was very happy with the latest upgrade process just not so happy with the after affects.

  • Open a ticket with support. I believe the more tickets we can open with support the better the chances they realize they need to get this fixed quicker. 

    I've tried everything already. To the point of complete uninstall and reinstall of the entire environment and it didn't help. Modified many things and it didn't help. Repaired specific components and it didn't help. I mean we ran the entire list of things we could think of and polling continued to fail. 

    That's when I came to thwack and found out I wasn't alone. Apparently there are many environments out there affects since upgrading. But I think solarwinds won't move on this quick enough unless they have enough tickets. Once they can see that this is a larger issue they'll move on it a bit quicker. 

Reply
  • Open a ticket with support. I believe the more tickets we can open with support the better the chances they realize they need to get this fixed quicker. 

    I've tried everything already. To the point of complete uninstall and reinstall of the entire environment and it didn't help. Modified many things and it didn't help. Repaired specific components and it didn't help. I mean we ran the entire list of things we could think of and polling continued to fail. 

    That's when I came to thwack and found out I wasn't alone. Apparently there are many environments out there affects since upgrading. But I think solarwinds won't move on this quick enough unless they have enough tickets. Once they can see that this is a larger issue they'll move on it a bit quicker. 

Children
No Data