Node/VIM: Memory Safety Net Alert

Question

So had a fun little scenario occur the other day and curious how other's have addressed it.  We had a node that a service went crazy and pegged the system at 100% memory utilization.  The system was still responding to ping, but for everything else it was down.  The node status was still up due to the ping status and never went to an unknown state.  The resource monitors maintained the previous values and I did not see a change to unknown for them.  I'm going to get some service monitors in place to keep an eye on it that way, and we do have alerts configured for unknown status on applications (toggleable based on some application custom properties).  I could base node status off of WMI (Windows) and SNMP (Linux), but I still prefer ping in that regard.

This scenario got me thinking of ways to build in a safety net style alert.  I want to try and avoid alert fatigue and not send multiple alerts for the same issue.  So I was thinking of tapping into the data provided by VIM.  Anyone else do something similar or have a different approach for these kind of scenarios?  The object selected below is set to Virtual Machine.  I am using a scope to target items based on custom properties.

We had an issue in the past where critical values were getting reached but the critical value reached flag did not trip, hence the or block.  We also like to not alert on resource utilization if a system is powered off/down.

adam.beedell · Answer

I always find people like the Idea of CPU/Memory alerts more than the pratice lol, long smoothing times are good paired with checks for unresponsiveness so I'd say you're in the right place.

Usually ping up but unresponsive is alertable under "not-responding" type alerts, there's some examples on thwack and a OOTB in recent versions. Those do check on the node layer as default not the VM layer-

(but if you havnt got at least one alert active for that sort of thing I'd reccomend it anyway. I put up an example of a bulk one for polling engines bugging out not that long ago)

- on the VM layer you probably also get a bad status and some alarms you can use and probably no loss in data. Getting a bit of coverage overlap here is fine, stuff can fail 3 different ways no problem at all.

If you get the chance to do as brscott mentions and ask the application either how it's doing or to do stuff that tends to be super reliable and clean but tends to be high-effort

brscott · Answer

I have found that the memory reported from VMware is just not that useful... it's not solarwinds fault, it's just the way vmware reports it.  We do some alerting on the Node's memory reaching warning/critical state.  That seems to work without needing to directly compare it to the critical value.  Not sure, if there is a delay on the state getting updated once it breaches the critical value, but there should not be, that should be handled as it's value is updated.

If it's at all possible, I generally prefer to alert on the responsiveness of the application running on the server.  Not the memory and CPU thresholds.  I use those for reporting and troubleshooting... but there are always exceptions.  you gotta do what you gotta do.