I have a server that is locking up. Unfortunately, the only way I normally find out that it is in an unresponsive state is when I try to log in to it. While it is locked up, you can still ping it and none of the component monitors that I have monitoring the server appear down. The server's main function is a backup server, so obviously when I try to log in to it in the morning to pull tapes, that is when I find out that I don't have backups from the previous night because the server was locked up. This doesn't happen every night, but it has happened enough to become a nuisance.
Does an APM component already exist to look for whether or not you can remote desktop into a server? If not, is there some way to build a component monitor for this?
Any and all help is greatly appreciated.
This seems to be a difficult thing to check, because there are several different bits of the logon process which can freeze. Just checking port 3389 is totally insufficient.
Nagios has a Python check_x224 script to check the early parts of the login process - I've used it (with Nagios) with varying levels of success for a year or so. The only time I think it has alerted is when the firewall on a server became enabled. It has missed several instances of frozen RDP sessions.
I find that monitoring memory usage (paged & non-paged pools) is an easier thing to monitor & can prevent RDP sessions freezing by way of a preventative reboot when 32bit servers reach about 130MB+ of non-paged pool usage.
Well you could try a TCP Port Monitor on 3389 to check to see if the MS RDP Port is open. Next time it locks up, try the following command from your desk:
telnet servername 3389
That will try to connect using Port 3389, if you get a connection then the server is listening, but if it fails you can use this as a trigger.
Other than RDP have you tried to browse to a share (Start > Run > \\servername\someshare) or remotely manage it with Computer Management or other Windows management tools? Could always create an APM to watch for the existance of a particular directory or file and trigger an alert if it goes into an "Unknown" state for more than 10 minutes (Provided the node is "Up").
I've had servers show similar issues and generally when it locked up I also could not remotely manage it or poll it with SNMP/WMI. On one particular tape media server I would get an alert that the local disks were in an unknown state. Now there are many reasons why a node/interface/volume could go into an unknown state, but every time I got them from that server the system was locked up.
I'm not aware of an application template monitor that can simulate an RDP session and alert if it fails. However, here's a couple of other suggestions
Hopefully, one of these is a feasible solution for you. Good luck.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.