cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 9

"Connection timeout. Job Canceled by scheduler" - error

Jump to solution

Hi,

We are getting "connection timeout. Job canceled by Scheduler"- error for all the monitors in application template.( eg for SNMP monitor, script monitor etc in one template itself) . Could some one shed some light on this? What is this error and why is it happening? How can we rectify it?

1 Solution
Level 21

In large installations this can indicate that your scheduler is overloaded.  I have also seen this happen as a result of one or two component monitors not working properly causing a chain reaction since the APM polling is done in serial, not parallel.

If you don't think that your scheduler should be overloaded try going into the WebUI and clicking on the "Unmananage" button for the application and once that is completed then click on "re-manage".  This will update the job on the scheduler and I have had cases where this would resolve the problem.

If that doesn't work, you don't think any of your components are problematic, and your scheduler shouldn't be overloaded then you should definitely open a support ticket with SolarWinds.

Hope this helps!

View solution in original post

13 Replies
Level 13

I would try following:

1) increase timeout on application

2) enable debug logging and inspect log to see whether polling process is stuck in some specific operation (component)

3) isolate long time polling component(s) using test from application edit page one by one

How many components in application are we talking about?

0 Kudos


I would try following:

1) increase timeout on application



I see this given as an answer fairly often. How is the default of 5 minutes not enough of a timeout period? If a component takes more than 5 minutes to complete polling then there's something wrong.

Byron,

How can you check to see if the scheduler is overloaded? Are there metrics that can be collected or specific log entries that indicate problems?

I see this given as an answer fairly often. How is the default of 5 minutes not enough of a timeout period? If a component takes more than 5 minutes to complete polling then there's something wrong.

This is because the timeout value is for all component monitors in the template, not for individual component monitors. In some situations, be that there are too many component monitors in the template, the connection to the monitored server is highly latent, or a complex script monitor or other component takes a very long time to return results that the 5min timeout for the application as a whole is insufficient. Another possibility is an overload poller, but this is not normally the case.

It's also a good troubleshooting step. I would recommend doubling the value and seeing if that reduces or eliminates the timeouts you're seeing. At that point we'll be able to narrow down the cause of why the timeouts are occurring. 



This is because the timeout value is for all component monitors in the template, not for individual component monitors. In some situations, be that there are too many component monitors in the template, the connection to the monitored server is highly latent, or a complex script monitor or other component takes a very long time to return results that the 5min timeout for the application as a whole is insufficient. Another possibility is an overload poller, but this is not normally the case.

It's also a good troubleshooting step. I would recommend doubling the value and seeing if that reduces or eliminates the timeouts you're seeing. At that point we'll be able to narrow down the cause of why the timeouts are occurring. 



Ok, that makes sense. If you've got large templates with a ton of components or very complex scripts then I could see where that might be an issue. Are they executed serially or in parallel? It seems like you'd need a huge template to tie things up for a full 5 minutes, even if they run sequentially. Most of my issues are with fairly small templates, usually less than 5 components, sometimes just one or two components, but it makes sense in the context you provide.

Level 21

In large installations this can indicate that your scheduler is overloaded.  I have also seen this happen as a result of one or two component monitors not working properly causing a chain reaction since the APM polling is done in serial, not parallel.

If you don't think that your scheduler should be overloaded try going into the WebUI and clicking on the "Unmananage" button for the application and once that is completed then click on "re-manage".  This will update the job on the scheduler and I have had cases where this would resolve the problem.

If that doesn't work, you don't think any of your components are problematic, and your scheduler shouldn't be overloaded then you should definitely open a support ticket with SolarWinds.

Hope this helps!

View solution in original post

Thanks for the reply.

I have just tried unmanaging and remanaging it. It didnt work out.

I have found the issue is with many of the script monitors in it. There are around 10 script monitors and 3 SNMP monitors in the template. All script monitors are failing the test. It gives a message - "Testing on node xxxxx: failed with 'Undefined' status
Error setting prompt (PS1). Invalid state. Connect first. " . All the script monitors are giving this same error message.

Could someone let me know what this error means?

0 Kudos

I was able to find the issue. Solarwinds credential was expired on the node.. and it was giving a prompt to "Enter new Password:" . Without knowing this, application is trying to run the script and it will throw the error. - "Testing on node xxx: failed with 'Undefined' status Error setting prompt (PS1). Invalid state. Connect first. "

Got rectified after making the password reset and giving the right credential. Thanks everyone for a quick response.

You nailed it!!...  my Unix guys kept telling me its Orion, so when I logged in to try the script manually I got "password expired"

THANKS!!



Error setting prompt (PS1). Invalid state. Connect first. " . All the script monitors are giving this same error message.

Could someone let me know what this error means?



This error probably means that SSH connection was closed during sequence of commands which were sent to target machine during attempt to connect & upload script & execute it & collect results. Debug logs could tell us more. I recommend to open support ticket.

I am now getting the same error on 22 applications (out of 326).  unmanage/remanage has no effect.  rebooting the servers tonight to see if that helps.  Also opening a ticket...

0 Kudos

Even easier way to force job update (recreation) should be "Poll now" from application detail resource.

0 Kudos


Even easier way to force job update (recreation) should be "Poll now" from application detail resource.



I just had this problem with an application yesterday, "Poll Now" for some reason didn't fix it; however "Unmanage"/"Re-Manage" did fix it.  I am not completely clear on the specific differences regarding the scheduler but there definitely seems to be something different.

0 Kudos

When you click Poll Now, this reschedules a job to run "now" in the job scheduler, meaning, if the job scheduler is already full, the job may not execute immediately, but it will execute sooner than the next scheduled poll cycle.

Depending upon how many components are in a given template, and the type of component monitors that make up a template, it is not unusual for polling to take several minutes for the process to fully complete. To see this behavior, click Edit Application Monitor on the Application Details page and then click Test All.