The concept of server and service monitoring has been on my mind for a little while. I'd like to know what it takes for us SysAdmins and NetAdmins to consider a server, appliance, or application to be truly "up" and available.
I think each of us, in the beginning of our IT careers, probably made a script that pinged some important servers or appliances and reported back with either "It's up!" or "PANIC!!" Of course, we eventually noticed that sometimes a user-facing service isn't responding in spite of pings being returned. It is then that we realize ICMP isn't the best judge of service availability. ICMP returns tell us that ICMP is working and that's about it (and of course that the network path between the pinger and pingee is functioning at a basic level).
I know a colleague who does not consider one of his database servers to be "up" until it has passed the following checks:
- It answers ping
- It answers SNMP queries for host-resource information
- It answers DB sanity queries
- It answers DB status queries
- It answers business-logic trending queries
Notice the escalating manner that services are checked. ICMP is a drop-dead simple check that assures basic TCP/IP availability. Then various important host resources are checked via SNMP. After that, checks are performed on the actual service that the server was provisioned for in the first place. Those checks increase in complexity until finally it can rightfully be said that, if the checks return correct data, the service is truly up.
So how does everyone else check to make sure that their vital services are truly up? I'm curious to know how in-depth people's checks are with each service that they offer.
For example, if you have an email service (notice, I didn't say "server" since the entire service can rely on multiple servers), do you simply have a script to call telnet and banner grab the SMTP service? Or do you have a test mailbox that you connect to with POP3, IMAP and/or MAPI to send test messages from? Do you test the availability of your webmail front end? Do you test the login forms?
How about your websites? Are you pinging the server, checking to see if Apache or IIS is running or are you actually checking for page loads and form responses? Do you consider your website up if it responds at all or is it considered down or critical if response times exceed a set time?