We have a few Citrix servers that will go "off in the weeds" if you will at random. Typically this only happens once every 3-4 months however when it does it is very random and hard to monitor/alert for. When it happens users are still able to login, launch applications etc but they will receive a plethora of error messages.
We originally tried monitoring services to however 9 times out of 10 all services and processes are running properly.
The last 3 times however we have noticed that we are not getting any WMI data from the server starting at the time issues started occurring on the server. Unfortunately the WMI calls appear to be still working because the node does not go into a down/unknown state when this occurs so we aren't able to monitor when we can't connect to the server.
We are trying to come up with a method in which we can say we have not received CPU data from Machine X in Y minutes and trigger an alert based on that.
We are currently working on building out a new Citrix Environment so this will hopefully not be a problem in the future however in the meantime we are trying to stop gap and catch the issue and bring it to resolution in a quicker manner.
Any thoughts/tips would be appreciated.
Thanks!