Hello,
I have an issue with NPM incorrectly reporting reboots on Linux servers.
The problem is how snmpd on Linux interprets uptime based on the .1.3.6.1.2.1.1.3.0 OID. It measures how long the process has been alive for rather than the uptime of the OS.
# Running snmpwalk against a Linux server that hasn't been touched in a while
snmpwalk -v 2c -c public $LINUX_SERVER .1.3.6.1.2.1.1.3.0
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1195874669) 138 days, 9:52:26.69
# Restart the SNMP daemon
snmpwalk -v 2c -c public $LINUX_SERVER .1.3.6.1.2.1.1.3.0
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (373) 0:00:03.73
So although the server uptime hasn't changed I'm getting alerted to a reboot problem.
There is a better OID to use that represents the actual uptime of the system...
snmpwalk -v 2c -c public $LINUX_SERVER host.hrSystem.hrSystemUptime.0
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (1195876954) 138 days, 9:52:49.54
But this counter seems to give weird results for Windows servers - this server has been up for about a month...
snmpwalk -v 2c -c public $WINDOWS_SERVER host.hrSystem.hrSystemUptime.0
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (3254887187) 376 days, 17:21:11.8
snmpwalk -v 2c -c public $WINDOWS_SERVER .1.3.6.1.2.1.1.3.0
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (325537591) 37 days, 16:16:15.91
Does anyone have any opinions about the apparently quirkiness of uptime monitoring with NPM? (or maybe the quirky nature of SNMP installations!)
Thanks
~sm