This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

NCM Firmware Upgrade Timers and Reboot Detection

Does anyone have a 100% grasp on the timers and reboot detection for the firmware upgrade module?

I have tried a few things with Solarwinds support and we are still working together to get consistent results but still having reboot detection issues.

We have tried changing the firmware upgrade timers here Orion/Admin/AdvancedConfiguration/Global.aspx

Specifically the FirmwareUpgradeRebootDeviceTimeout but the result was not as expected.

We also tried adjusting the timer in the template itself by the method of export the current template, adjust it in notepad, and reimport it with the new value greater then the normal limit.


In both cases we have inconsistent detection of the reboot. Sometimes it just hangs on the reboot phase and never completes or errors even though the device has rebooted after solarwinds issued the reboot command and the device has in fact been upgraded.

In my case the reason there is a delay is that the install command cannot be executed in a manner that will allow to finish everything but the reboot and than the reboot phase would simply be the reload command. Some devices have this option and than I find the reboot phase detection works great because as soon as the command is issued in that phase it starts reboot which is what I think Solarwinds is expecting. In my case the device runs the commands starts installing and takes somewhere around 10 minutes to start the reboot and than another 5 minutes approximately to complete the reboot. That said those times may vary somewhat.

What I'm looking for is a concrete understanding on how the reboot phase executes and what it needs to detect in order to move to the verify upgrade phase. I am baffled because sometimes from an error in the template I have had a device go through the reboot phase to the verify upgrade phase without ever rebooting - it fails on the verify upgrade section but its confusing because it moved on without a reboot or packet loss on the ICMP actually occurring. I have asked support for the same and will post what I learn here but if anyone can help me along that would be great thank you! 



Parents
  • I'm still getting some intermittent experience with the reboot detection on the Catalyst 9300 where I am using the install activate prompt-level none command and it takes about 10 minutes to finished the install followed by about a 5 minute reboot for the most part it is working and now I am thinking it might be a server resource issue though I haven't been able to recreate the issue recently...

    Here is what I learned so far from support in case it might help others:

    Timers here Orion/Admin/AdvancedConfiguration/Global.aspx

    FirmwareUpgradeRebootDeviceTimeout to 25 minutes will allow all firmware upgrade operations to see a reboot anywhere within 25 minutes of issuing the reboot command. If you set this to 30 minutes and the whole upgrade takes 15-20 minutes Solarwinds will not wait till the 25 minutes is over to finish etc. so I haven't seen any negative consequence in increasing this number a bit above the expected time to ensure that any outliers don't error out the upgrade job. 

    FirmwareUpgradeRebootDeviceNodeDownTimeout to 30 minutes  will allow all firmware upgrade operations to not error the job until after 30 minutes is what is advertised in the description on the setting page as well as what Solarwinds support told me; however, in practice I have never seen this timer work. The issue I have had when the reboot is not detected is that the upgrade operation will wait or hang indefinitely on the first node and not error out after 30 minutes etc. so in my experience this timer doesn't work - might be an issue specific to me but no cause has been found yet...

    The timer in the template which as I noted previously can be increased beyond 10ms via the export and import template feature was intended to be used for devices like Nexus switches where an upgrade will not cause the node to go down - dual supervisors etc. I don't think its the right application for a delayed reboot since the device will in fact go down and cause loss of connectivity etc. 

    Wish I was more confident in these timers. For the most part the advanced setting have worked for me now on 8/10 days. I don't believe its the devices causing the issues on the days it doesn't work because it would fail on the first node, I would monitor the progress and the device would reboot at the expected time and come back on line at the expected time running the new version - so the only issue is the reboot detection... Rather frustrating, hope this gets to be more reliable in the future with easier debugging options.

Reply
  • I'm still getting some intermittent experience with the reboot detection on the Catalyst 9300 where I am using the install activate prompt-level none command and it takes about 10 minutes to finished the install followed by about a 5 minute reboot for the most part it is working and now I am thinking it might be a server resource issue though I haven't been able to recreate the issue recently...

    Here is what I learned so far from support in case it might help others:

    Timers here Orion/Admin/AdvancedConfiguration/Global.aspx

    FirmwareUpgradeRebootDeviceTimeout to 25 minutes will allow all firmware upgrade operations to see a reboot anywhere within 25 minutes of issuing the reboot command. If you set this to 30 minutes and the whole upgrade takes 15-20 minutes Solarwinds will not wait till the 25 minutes is over to finish etc. so I haven't seen any negative consequence in increasing this number a bit above the expected time to ensure that any outliers don't error out the upgrade job. 

    FirmwareUpgradeRebootDeviceNodeDownTimeout to 30 minutes  will allow all firmware upgrade operations to not error the job until after 30 minutes is what is advertised in the description on the setting page as well as what Solarwinds support told me; however, in practice I have never seen this timer work. The issue I have had when the reboot is not detected is that the upgrade operation will wait or hang indefinitely on the first node and not error out after 30 minutes etc. so in my experience this timer doesn't work - might be an issue specific to me but no cause has been found yet...

    The timer in the template which as I noted previously can be increased beyond 10ms via the export and import template feature was intended to be used for devices like Nexus switches where an upgrade will not cause the node to go down - dual supervisors etc. I don't think its the right application for a delayed reboot since the device will in fact go down and cause loss of connectivity etc. 

    Wish I was more confident in these timers. For the most part the advanced setting have worked for me now on 8/10 days. I don't believe its the devices causing the issues on the days it doesn't work because it would fail on the first node, I would monitor the progress and the device would reboot at the expected time and come back on line at the expected time running the new version - so the only issue is the reboot detection... Rather frustrating, hope this gets to be more reliable in the future with easier debugging options.

Children
No Data