Hey everyone! I figured this would be a question best asked here before I go diving into my own rabbit holes. :)
We're occasionally seeing kernel panics on our Openshift Container Platform infrastructure servers. These kernel panics result in the system auto restarting and recovering shortly after, but our infrastructure team wants to be notified when this situation happens.
In the case of our latest kernel panic, we saw the following events:
12:04:51 PM - event - Node has stopped responding
12:07:02 PM - event - Responding again with response time of 2 milliseconds
12:07:56 PM - event - rebooted at 12:00:00 PM
Typically, we have an alert that triggers on a node being down if a custom property is set. This "node down" alert checks its status every 5 minutes, and the condition must be present for at least 4 minutes before the alert is sent.
Looking at the performance analyzer, the node was in an "unknown" state for some reason during this specific incident. Even then, however, it was not down long enough to trigger the 5 minute threshold.
With this in mind, I'm trying to think of the best way to inform the infrastructure team of these kernel panics. My immediate thought was to send an alert on an event. I am worried that such might catch expected/desired reboots and not the kernel panics they care about - especially if I go for something simple like "Boot Time Changed." I don't know what kind of event category that "not responding" event is though - I only saw options for status changes.
The node in question is running RHEL 7.6, sitting on top of a dedicated VMWare ESX server. The node is monitored using an agent. We do not have SNMP monitoring set up for Linux, and I don't think we are doing anything in terms of capturing logs in Orion.