This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

How to approach alerting on unexpected reboots?

Hey everyone! I figured this would be a question best asked here before I go diving into my own rabbit holes. :)

We're occasionally seeing kernel panics on our Openshift Container Platform infrastructure servers. These kernel panics result in the system auto restarting and recovering shortly after, but our infrastructure team wants to be notified when this situation happens.

In  the case of our latest kernel panic, we saw the following events:

12:04:51 PM - event - Node has stopped responding

12:07:02 PM - event - Responding again with response time of 2 milliseconds

12:07:56 PM - event - rebooted at 12:00:00 PM

Typically, we have an alert that triggers on a node being down if a custom property is set. This "node down" alert checks its status every 5 minutes, and the condition must be present for at least 4 minutes before the alert is sent.

Looking at the performance analyzer, the node was in an "unknown" state for some reason during this specific incident. Even then, however, it was not down long enough to trigger the 5 minute threshold.

With this in mind, I'm trying to think of the best way to inform the infrastructure team of these kernel panics. My immediate thought was to send an alert on an event. I am worried that such might catch expected/desired reboots and not the kernel panics they care about - especially if I go for something simple like "Boot Time Changed." I don't know what kind of event category that "not responding" event is though - I only saw options for status changes.

The node in question is running RHEL 7.6, sitting on top of a dedicated VMWare ESX server. The node is monitored using an agent. We do not have SNMP monitoring set up for Linux, and I don't think we are doing anything in terms of capturing logs in Orion.

  • The best thing is to be able to monitor a log to see if there is a specific event you can trigger on.  

    Many of the VM servers can reboot so fast that they are never really seen as down and the only way to know is to either track system uptime (not agent uptime as an agent can be restarted by outside influences) or look for a specific event in a log.

  • We've not had a ton of success with event monitoring outside of Windows with SAM so far. Our network log traffic flows to a Splunk instance, and we likely are going to be expanding that in the future (though if Log Analyzer can compete, more power to it).

    Any suggestions for where to poke around?

  • First you need to identify a related message in a log file such as /var/log/messages.

    Then you can work on ways to get the event out of the logfile.  I believe there is an ability to watch logs with the agent.  As well as in RHEL you can issue the command uptime, it will give you how log the system has been up since the last reboot.  If less than 5 minutes and outside of your scheduled maintenance windows may be criteria for investigation.

  • This "node down" alert checks its status every 5 minutes, and the condition must be present for at least 4 minutes before the alert is sent.

    Spit-balling some ideas here...

    If I'm reading this right, you are polling every 5 minutes.  So, your polling times would be 12:00 and 12:05.

    • 12:00 - Poll - everything is good
    • 12:02 - (between polls) - machine goes offline (down)
    • 12:05 - Poll - poll reports down or unknown.
    • 12:06 - (between polls) - machine finishes rebooting (back up)
    • 12:10 - Poll - everything is good

    So, that means that you will always get an alert because the problem needs to exist for only 4 minutes (less than the polling time).  If you aren't getting an alert it's because the device has gone to something other than 'Down' (like Unknown, Critical, Warning, Unmanaged, etc.)

    I was bit by a similar scenario many years ago and decided to change up a few things.

    • Trigger Condition: Change from "Node is Down" to "Node is not Up, Critical, or Warning"
    • Trigger Timing: Must exist for 2 (or more) polling cycles.  This eliminates the 'things happen between polls' events.

    As said, if you need to know now-now then you'll have to rely on some type of 'push' message from the server (syslog, trap, something else).  It can be done (and not to difficultly with rsyslog or similar on Linux, but you'll need to take your time setting up your filters/alerts/tags from that side of the house.

    You could rely on the VMware events (assuming it's running on ESX), but "Powered On" does not necessarily equal "Up."

  • Oh - does the team want these things as alerts or just as a Node Reboot Report for the Last XX time?  Like - is there anything they are going to do after seeing the alert?  If not, I say don't tax your system by running a million alerts and instead work on a nice report for them.

  • This is a good point, and a report would be useful.

  • My infrastructure team member did find a specific event log that gets reported back to the VMWare host. But it doesn't look like this gets to our overall vsphere instance logs, that Orion collects. I'm not sure.  We'll probably explore the report option below, but knowing if Orion could see this alert would be helpful.

  • ... I think I found the problem. A thread indicated that you need to tell your ESXi hosts to forward their logs to Orion's Log Manager like you would for any networking device. Because we haven't done that, Orion has no clue what logs are out there for the individual hosts, only the overall vsphere config.

    thwack.solarwinds.com/.../how-to-monitor-esxi-hosts-with-log-manager

  • Okay, now I feel like I'm doing something wrong and I'm down a completely different rabbit hole. We have the VMAN Event logging add-on in place in our instance of Orion, as far as I can tell. But it's only collecting events from our vsphere and vsphere-dr instances, not the individual ESXi hosts like I would expect. I wonder if it is a permissions issue.

    Anyways, I'll stop derailing my own thread as this is supposed to be focused on alerting and reporting.

  • Okay. This looks really easy to modify the OOTB ones. So the question becomes scheduling it.

    Is there a way to have a report only send if there is actually activity? We're really trying to cut down on spam, but if all we do is send a daily report that says "nothing rebooted," it'll get tuned out.