We are experiencing a unique issue with several of our monitored Windows Servers and I am hoping you guys can provide some guidance.
There are occasions where our Windows Servers crashed and are sitting at the BSOD screen but the server still responds to pings (yes these are virtualized in VMware). So as far as SolarWinds NPM is concerned there is no gap in monitoring and the server is still in an "UP" state. Also, as far as the VMware host is concerned the VM is still in a running state. The only thing I've noticed when this happens is that NPM shows no gaps in network latency, but the CPU and RAM statistic gathering is blank.
I know the best work around is this --- use WMI and monitor a service. So if the server doesn't respond to WMI then we can alert against that. But I have been forbidden to use WMI due to the fact that it causes memory issues with these specific application servers. I cannot use WMI in this case.
Is there a way to monitor for gaps in CPU or RAM statistic gathering? Such as when there are gaps of time where there is no vital statistics gathered from a monitored node even though it is responding to PING?
- Joe