This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

"Connection timeout. Job Canceled by scheduler" - error

Hi,

We are getting "connection timeout. Job canceled by Scheduler"- error for all the monitors in application template.( eg for SNMP monitor, script monitor etc in one template itself) . Could some one shed some light on this? What is this error and why is it happening? How can we rectify it?

  • In large installations this can indicate that your scheduler is overloaded.  I have also seen this happen as a result of one or two component monitors not working properly causing a chain reaction since the APM polling is done in serial, not parallel.

    If you don't think that your scheduler should be overloaded try going into the WebUI and clicking on the "Unmananage" button for the application and once that is completed then click on "re-manage".  This will update the job on the scheduler and I have had cases where this would resolve the problem.

    If that doesn't work, you don't think any of your components are problematic, and your scheduler shouldn't be overloaded then you should definitely open a support ticket with SolarWinds.

    Hope this helps!

  • Even easier way to force job update (recreation) should be "Poll now" from application detail resource.



  • Even easier way to force job update (recreation) should be "Poll now" from application detail resource.



    I just had this problem with an application yesterday, "Poll Now" for some reason didn't fix it; however "Unmanage"/"Re-Manage" did fix it.  I am not completely clear on the specific differences regarding the scheduler but there definitely seems to be something different.

  • When you click Poll Now, this reschedules a job to run "now" in the job scheduler, meaning, if the job scheduler is already full, the job may not execute immediately, but it will execute sooner than the next scheduled poll cycle.

    Depending upon how many components are in a given template, and the type of component monitors that make up a template, it is not unusual for polling to take several minutes for the process to fully complete. To see this behavior, click Edit Application Monitor on the Application Details page and then click Test All.

  • I would try following:

    1) increase timeout on application

    2) enable debug logging and inspect log to see whether polling process is stuck in some specific operation (component)

    3) isolate long time polling component(s) using test from application edit page one by one

    How many components in application are we talking about?



  • I would try following:

    1) increase timeout on application



    I see this given as an answer fairly often. How is the default of 5 minutes not enough of a timeout period? If a component takes more than 5 minutes to complete polling then there's something wrong.

    Byron,

    How can you check to see if the scheduler is overloaded? Are there metrics that can be collected or specific log entries that indicate problems?

  • I see this given as an answer fairly often. How is the default of 5 minutes not enough of a timeout period? If a component takes more than 5 minutes to complete polling then there's something wrong.

    This is because the timeout value is for all component monitors in the template, not for individual component monitors. In some situations, be that there are too many component monitors in the template, the connection to the monitored server is highly latent, or a complex script monitor or other component takes a very long time to return results that the 5min timeout for the application as a whole is insufficient. Another possibility is an overload poller, but this is not normally the case.

    It's also a good troubleshooting step. I would recommend doubling the value and seeing if that reduces or eliminates the timeouts you're seeing. At that point we'll be able to narrow down the cause of why the timeouts are occurring. 

  • Thanks for the reply.

    I have just tried unmanaging and remanaging it. It didnt work out.

    I have found the issue is with many of the script monitors in it. There are around 10 script monitors and 3 SNMP monitors in the template. All script monitors are failing the test. It gives a message - "Testing on node xxxxx: failed with 'Undefined' status
    Error setting prompt (PS1). Invalid state. Connect first. " . All the script monitors are giving this same error message.

    Could someone let me know what this error means?



  • This is because the timeout value is for all component monitors in the template, not for individual component monitors. In some situations, be that there are too many component monitors in the template, the connection to the monitored server is highly latent, or a complex script monitor or other component takes a very long time to return results that the 5min timeout for the application as a whole is insufficient. Another possibility is an overload poller, but this is not normally the case.

    It's also a good troubleshooting step. I would recommend doubling the value and seeing if that reduces or eliminates the timeouts you're seeing. At that point we'll be able to narrow down the cause of why the timeouts are occurring. 



    Ok, that makes sense. If you've got large templates with a ton of components or very complex scripts then I could see where that might be an issue. Are they executed serially or in parallel? It seems like you'd need a huge template to tie things up for a full 5 minutes, even if they run sequentially. Most of my issues are with fairly small templates, usually less than 5 components, sometimes just one or two components, but it makes sense in the context you provide.



  • Error setting prompt (PS1). Invalid state. Connect first. " . All the script monitors are giving this same error message.

    Could someone let me know what this error means?



    This error probably means that SSH connection was closed during sequence of commands which were sent to target machine during attempt to connect & upload script & execute it & collect results. Debug logs could tell us more. I recommend to open support ticket.