29 Replies Latest reply on Jun 20, 2012 8:20 AM by teqnomad

    Better Method of Calculating Uptime

    jspanitz

      So we just made the mistake of updating (adding) our SNMP community strings on our servers and restart the snmp agents to get the settings to take.  What we failed to realize was that Orion reads the uptime of the SNMP agent as the uptime of the servers.  BAD BAD BAD.  Lots of false reboot messages to lots of people.  Needless to say, we are now scrambling to provide a REAL uptime calculation.


       


      So we are requesting that two things occur.  First, server uptime be moved from OID 1.3.6.1.21.25.1.1 to 1.3.6.1.2.1.25.1.1, which is actual server uptime, not the agent.  Second, some additional checking occurs when Orion sees the counter reset.  On windows this could be check the event log for a Shutdown Event / Reason (this would allow Orion to flag the reboot as intentional vs. unintentional).  There has to be some similar log on linux.


      Please chime in if you would like to see this sooner rather than later.


       


      Edit - To further support my cause, other posts on the topic are below:


      Advanced Alerts:System Uptime HP Windows 2003/2008


      Misleading SNMP Uptime information


      Re: Redline Radios showing constant rebot


        • Re: Better Method of Calculating Uptime

          Wow John--great post. Will make sure PM sees it.

          M

          • Re: Better Method of Calculating Uptime
            KwameB

            Adding my name to the the petition

            • Re: Better Method of Calculating Uptime
              adauria

              Add my vote as well.

              As a potential solution / work-around, adding the conditional test "time goes backwards" to the advanced trigger options might help.

              This test would check the current value read from an OID against the previous value and if less than the previous value, the trigger fires.

               

              just an idea.

              • Re: Better Method of Calculating Uptime
                julrich

                Adding my vote.

                1. Reporting SERVER Uptime based on SNMP Uptime is a LIE! I cannot accurately tell my company the uptime of ANY device we have.

                2. Multiple times we have experienced issues with the SNMP service on a windows machines, or the snmpd process on Linux machines and have to restart the service/process. Every time this happens, I have to ReplyAll to the alert email that gets sent out saying "I'm sorry, Solarwids Orion is reporting a false alarm. This was not a reboot, rather this was me resolving an SNMP issue."

                • Re: Better Method of Calculating Uptime
                  kcarson

                  throw me on there as well.. this is something that needs to be done. I can't for the life of me figure out why uptime would ever be based off of a service that can be recyled. just my 2c

                  • Re: Better Method of Calculating Uptime
                    byrona

                    I agree that using a counter that can easily be reset to determine Uptime is generally a bad idea and fully support considering a better solution.

                    However, the problem with the suggested solution is that the OID 1.3.6.1.2.1.25.1.1 only exists on most servers but not on other SNMP supported devices such as Network gear.  If we are going to move to a new model for Uptime calculations I would like to see it support all devices, not just servers.

                      • Re: Better Method of Calculating Uptime
                        jspanitz

                        Byron - we completely agree.  If it can be done in a way that works for everything, great.  If it needs to be done two or more ways depending on the hardware, so be it.  As long as it is true uptime!

                          • Re: Better Method of Calculating Uptime
                            byrona


                            Byron - we completely agree.  If it can be done in a way that works for everything, great.  If it needs to be done two or more ways depending on the hardware, so be it.  As long as it is true uptime!

                             



                            I totally agree!

                              • Re: Better Method of Calculating Uptime

                                Hi All--

                                I've marked this for the PM to review.

                                Thx,

                                M

                                • Re: Better Method of Calculating Uptime
                                  ctopaloglu

                                  Duplicate message please remove.

                                    • Re: Better Method of Calculating Uptime
                                      ctopaloglu


                                      Adding my vote.

                                      1. Reporting SERVER Uptime based on SNMP Uptime is a LIE! I cannot accurately tell my company the uptime of ANY device we have.

                                      2. Multiple times we have experienced issues with the SNMP service on a windows machines, or the snmpd process on Linux machines and have to restart the service/process. Every time this happens, I have to ReplyAll to the alert email that gets sent out saying "I'm sorry, Solarwids Orion is reporting a false alarm. This was not a reboot, rather this was me resolving an SNMP issue."

                                       



                                      We had the same problem just like everybody else. And this is something to be fixed immediately. Just wonder will you improve (fix) uptime calculation according to our needs in the next version? 

                                      Also you can add the "System Uptime" section on the node details resource.

                                        • Re: Better Method of Calculating Uptime
                                          sean.martinez

                                          Orion reports the Sysuptime MIB as the Basis of Uptime because the majority of Devices support globally the SysUptime MIB from the SNMP Agent vs the HRSysUptime MIB (Net-SNMP devices usually do not Report HRSysuptime). As of Orion 10.1.2 we even added a fault tolerance for when a Device is using an NTP Server if the Clock is rolled back by less than 3 minutes, we will not report a Reboot.

                                           

                                          If you need this to be changed, I highly recommend that you Open a Support Ticket to have this officially added as a Feature Request.

                                            • Re: Better Method of Calculating Uptime
                                              ctopaloglu


                                              Orion reports the Sysuptime MIB as the Basis of Uptime because the majority of Devices support globally the SysUptime MIB from the SNMP Agent vs the HRSysUptime MIB (Net-SNMP devices usually do not Report HRSysuptime). As of Orion 10.1.2 we even added a fault tolerance for when a Device is using an NTP Server if the Clock is rolled back by less than 3 minutes, we will not report a Reboot.

                                              If you need this to be changed, I highly recommend that you Open a Support Ticket to have this officially added as a Feature Request.

                                               



                                              So you mean in order to make this request official I will have to open a support ticket. I didn't know that. I though this is the place for feature requests. As I understand you have two lists. Am I right?

                                              1 - Official feature requests (Submitting a ticket via customer portal)

                                              2 - Unofficial feature requests (post a thread on thwack)

                                              What's the difference between them? Because when I open a support ticket for future requests the staff told me to post on thwack. Can you clearify this for me?

                                                • Re: Better Method of Calculating Uptime
                                                  sean.martinez

                                                  You can put a Feature Request on here, but I have not seen any Product Management reply on this thread noting that it has been recorded. Support is able to submit Feature Requests to have anything added. The majority of customers that submit Feature Requests usually also put Feature Request into the Subject line of the cases so I would know which cases to submit to our Development.

                                                  I feel that it is important to provide a background of what the Orion Product Suite polls and why. 

                                                  I believe the best solution for Sysuptime would be to have a option under Settings> Polling Settings to have a radio button selection to poll by HrSysuptime or Sysuptime and to failover to the other OID should the value not be returned or if the value is 0. This way you would be able to poll the Sysuptime in a similar fashion to how we poll 64 bit counters on Interfaces. 

                                                    • Re: Better Method of Calculating Uptime
                                                      fcaron

                                                      Hi ctopaloglu,

                                                      You can do both. They won't necesseraly be processed and entered in the internal system at the same time (opening a case may get it in the system faster), but that should not make a big difference, in terms of when this request ends up in the product (if it does).

                                                      The requests posted here can also be processed in real time, in this case you'll se a product manager posting.

                                                      Or they may be scanned during the content decision phase for the next release.

                                                      No matter what, they will be looked at (10236)

                                      • Re: Better Method of Calculating Uptime
                                        jshrestha

                                        Absolutely agreed!!!!!!!!!!!

                                        • Re: Better Method of Calculating Uptime
                                          Atamido

                                          I created a partial workaround for creating a report that can be displayed in Orion, here:

                                          Polling and reporting real uptime

                                          • Re: Better Method of Calculating Uptime
                                            Atamido

                                            I created a partial workaround for creating a report that can be displayed in Orion, here:

                                            Polling and reporting real uptime

                                            • Re: Better Method of Calculating Uptime
                                              rjager

                                              Add me to the list as well, the current info is not valid for our windows boxes.

                                              I also have a doubt aboubt valid info that comes from our ESX servers.

                                              • Re: Better Method of Calculating Uptime
                                                teqnomad

                                                Hi,

                                                Can anyone confirm the recommended method for triggering an alert for hrSystemUptime?

                                                 

                                                I'm a newbie to Solarwinds. We have NPM v9.5.1.

                                                I've sorted

                                                1. configuring a custom universal poller (called hrSystemUptime which uses the hrSystemUptime MIB), assigned it to a network device (called testSLESVM), tested that it was able to communicate with the network device using the test button inside the assign wizard in the custom universal poller program,

                                                2. I've then tried to configure an alert in the advanced alert manager program with these settings:

                                                On the trigger page setting the following simple conditions:

                                                • PollerName equals hrSystemUptime
                                                • Nodename equals testSLESVM <-- name of the test
                                                • Status not equal to 0

                                                On the action page send an email with a message including Pollername, Status and RawStatus, but these 'status' values dont seem to look like timeticks units.

                                                For the time being, I put the polling time at 1 min and set 5 secs on the length of time for condition, so it would be reasonably sensitive during my testing.

                                                 

                                                It seems that the first time I reboot after setting the triggers, it works and then it fails to respond to further reboots, until I change the triggers and then resave with the Status not equal to 0, which pops up a message saying it is clearing the history, then it seems to work again just once. I've tried acknowledging the advanced alert in case this reset the Status but this doesn't work.

                                                 

                                                Am I using the right trigger conditions? Why is it only working once then needing history cleared.

                                                Which field is the actual 'uptime' value.

                                                 

                                                I've looked at a lot of the similar issues in the forum, but the SQL queries suggested seem overly complex.

                                                 

                                                At this point, I'd be happy just getting the value of uptime into every email sent, and then using an outlook rule to only send emails with small uptimes to my work phone.

                                                 

                                                Any suggestions?