So we just made the mistake of updating (adding) our SNMP community strings on our servers and restart the snmp agents to get the settings to take. What we failed to realize was that Orion reads the uptime of the SNMP agent as the uptime of the servers. BAD BAD BAD. Lots of false reboot messages to lots of people. Needless to say, we are now scrambling to provide a REAL uptime calculation.
So we are requesting that two things occur. First, server uptime be moved from OID 126.96.36.199.188.8.131.52 to 184.108.40.206.220.127.116.11.1, which is actual server uptime, not the agent. Second, some additional checking occurs when Orion sees the counter reset. On windows this could be check the event log for a Shutdown Event / Reason (this would allow Orion to flag the reboot as intentional vs. unintentional). There has to be some similar log on linux.
Please chime in if you would like to see this sooner rather than later.
Edit - To further support my cause, other posts on the topic are below:
Add my vote as well.
As a potential solution / work-around, adding the conditional test "time goes backwards" to the advanced trigger options might help.
This test would check the current value read from an OID against the previous value and if less than the previous value, the trigger fires.
just an idea.
Just an FYI - I'm using the performance (perfmon) counter System Up Time under 'System'. It returns a value in seconds that is accurate no matter if snmp restarts, event log service restarts, etc.
My alert is based on this value being less than 5 minutes. The idea being that if the server reboots, within the first polling cycle, Orion should detect this value is less than 5 minutes so it must have rebooted.
Plus there's the added bonus of converting this seconds into various periods. I have a report that shows the value in days, minutes, seconds, and the last date it rebooted (today's date minus the seconds).
Add me to the list here as well....this is something I've brought up and requested a few years ago and still hasn't bee integrated into the product.
Currently I poll my NetSNMP and any Linux/Unix based nodes every 10 minutes for hrSystemUptime. I then trigger an alert if the value of hrSystemUptime is 'less than or equal to' 60000 (which equals 10minutes).
This is just too much work on my end for something that should be built correctly into the product. I also have not found an equivalent accurate OID to poll for Cisco devices so I still have this issue with the majority of my nodes.
how do you do that for linux devices? I've not done that before. we have them, but I don't know them very well yet I'm supposed to monitor them. thanks!
Adding my vote.
1. Reporting SERVER Uptime based on SNMP Uptime is a LIE! I cannot accurately tell my company the uptime of ANY device we have.
2. Multiple times we have experienced issues with the SNMP service on a windows machines, or the snmpd process on Linux machines and have to restart the service/process. Every time this happens, I have to ReplyAll to the alert email that gets sent out saying "I'm sorry, Solarwids Orion is reporting a false alarm. This was not a reboot, rather this was me resolving an SNMP issue."
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.