Any thoughts on a report or alert I could use for a way to be alerted when WMI fails to poll information on a Node
Volume status = "Unknown" is a good one
Could try to modify the built in "Alert me when a managed node has not been polled during the last 5 tries" by adding a criteria where Object Type = WMI.
aLTeReGo for the win
Select Distinct (Nodes.Caption)
From Nodes
Inner Join Volumes
On Nodes.NodeID = Volumes.NodeID
Where Volumes.Status = 0
And Nodes.ObjectSubType = 'WMI'
And Nodes.Status = 1
I've been using the volume status = unknown in order to determine WMI issues per aLTeReGo's advice. It performs its function, however I get a separate alert for each volume on a node. For some servers there are 6-8 different volumes resulting in 6-8 separate emails being generated.
Any other ideas??
Have you considered adding an additional condition to the trigger only to alert on the 'C:\' drive?
I have now!!!
Oh, and also:
Hmm, I wonder if something similar can be used for my problem here: Alert on agents not responding
The problem is that volumes might poll OK and CPU/memory doesn't which means the volume unknown method won't work. It's a bit flaky.
aLTeReGo We can't figure out why yet, but we are having several various Windows server's (no consistency) stop responding. No blue screen, cannot RDP, but system responds to ping tests - suggesting it's more in the application level of the network stack. Obviously I can't use any WMI or PowerShell scripting. Using the command prompt and SHUTDOWN /r, etc. will not work either. Until we can determine the root cause, we've just been rebooting the affected servers in order to restore service. In order to reboot, we have to use iLo or reset the VM in Vsphere.
What are your thoughts on how to automate a reboot action when this occurs?
1.) Physical - using iLo
2.) Virtual - using VMAN
Because we are triggering on volume status, when I try to use "Manage VM - Power off" (BTW - I think it's odd there is no "Manage VM - Reset" action), I am required to select a specific VM. If I switch the trigger to virtual machine, I can perform the action on the offending node, but I can't determine on what I would trigger.
In that scenario, I would recommend using the Agent on those machines. When the server locks up, the node status will then change to accurately reflect a 'down' status. This will then allow you to automate the action against that node using the 'Manage VM - Reboot' action. As for the iLO, there are methods of performing this in a scripted fashion, but you would need some consistent and predictable method of referencing them to make one script suitable for all devices. In my previous environments I gave DNS names of the iLO and Dell DRAC's the server name it's associated with prepended with the type of out-of-band management card it was running. E.G. 'ilo-serversame.domain.ext' or 'drac-otherserver.domain.ext'. I then created a CNAME alias so I didn't need to remember which devices were HP and which were Dell. That CNAME was 'oob.servername.domain.ext'. This would allow a fairly easy to remember, predictable format suitable for scripting.