This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

What does it take to satisfy you when testing if a service is available or not?

The concept of server and service monitoring has been on my mind for a little while. I'd like to know what it takes for us SysAdmins and NetAdmins to consider a server, appliance, or application to be truly "up" and available.

I think each of us, in the beginning of our IT careers, probably made a script that pinged some important servers or appliances and reported back with either "It's up!" or "PANIC!!" Of course, we eventually noticed that sometimes a user-facing service isn't responding in spite of pings being returned. It is then that we realize ICMP isn't the best judge of service availability. ICMP returns tell us that ICMP is working and that's about it (and of course that the network path between the pinger and pingee is functioning at a basic level).

I know a colleague who does not consider one of his database servers to be "up" until it has passed the following checks:

  1. It answers ping 
  2. It answers SNMP queries for host-resource information
  3. It answers DB sanity queries 
  4. It answers DB status queries 
  5. It answers business-logic trending queries

Notice the escalating manner that services are checked. ICMP is a drop-dead simple check that assures basic TCP/IP availability. Then various important host resources are checked via SNMP. After that, checks are performed on the actual service that the server was provisioned for in the first place. Those checks increase in complexity until finally it can rightfully be said that, if the checks return correct data, the service is truly up.

So how does everyone else check to make sure that their vital services are truly up? I'm curious to know how in-depth people's checks are with each service that they offer.

For example, if you have an email service (notice, I didn't say "server" since the entire service can rely on multiple servers), do you simply have a script to call telnet and banner grab the SMTP service? Or do you have a test mailbox that you connect to with POP3, IMAP and/or MAPI to send test messages from? Do you test the availability of your webmail front end? Do you test the login forms?

How about your websites? Are you pinging the server, checking to see if Apache or IIS is running or are you actually checking for page loads and form responses? Do you consider your website up if it responds at all or is it considered down or critical if response times exceed a set time?

  • This is just as important when talking about SLAs from service providers. Many just guarantee "it's up", but don't even look at the level of service (if any) being provided.

    I wonder, too, about the impact of virtual networks. Might we see a time when a virtual IP gateway (VXLAN) responds to pings even when the machine is down/slow/munged?

  • Ohh, that's a good point! A service provider could say "Hey, your dedicated server was up the whole time our core router was in flames!" I think that's a bit facetious of me, however, when looked at in terms of "service availability" rather than just "it's pinging" you realize that an application or service is much more vulnerable than previously thought.

    If you'll pardon me, I have some new SNMP traps that I need to make. 

  • Maybe I am in the minority here, but sometimes I do rely on the users in my environment to help in testing the service for uptime.  Especially after an outage.  Sure there are things that need to be online (or back online) before the users notice, but they tend to be a pretty good notification as to the operation of a service (or server)... in the event that an email or sms message is missed.

  • I think it's very important to remember that an SLA has no bearing on reality. 

    It is a contractual agreement for the financial penalties the service provider incurs when the service levels fall below a certain point. There is no guarantee, explicit or implied, that the service will be available for periods of time exceeding the SLA. 

  • But doesnt the SLA provide a perceived guarantee? That a company should expect a certain level of service and complain when these perceptions aren't met by the service provider?

  • You're not in the minority, no. I rely on that kind of "monitoring" too - but I think the ideal scenario and the one to which we should all strive is to know about things before users do. Perhaps I've drank the ITIL / MOF / ITUP happy-flavored-poison, but I think it's within the realm of possibility to be able to test and alert on virtually any system for virtually any positive response and then send up warning flairs at the first sign of any non-positive response.

    However, the time and products necessary to make that ideology come to pass is often not tenable. So yes, we're then back to using the old method of "See if anyone hollers." =)

  • That's right according to the letter of the law - but perhaps misses the spirit? Yes, the letter states that, in essence, "Any service provided that does not meet X standards will allow Y remuneration in X fashion." The spirit is therefore that the service provider will "guarantee" uptime above the penalty level. To say that it has no bearing on reality? Is that not just a tad overstated?

    And that bring up the tertiary topic of SLAs. Ohhh my how you've made my mind spin. More on this topic later so keep that arrow nocked. =)

  • This is a really great question and I love seeing this type of thinking here in the forums.

    In my experience having worked at a managed services provider now for over 10 years is that for a service to be "available" it needs to be available for the users to interact with and it must be responsive enough for a good user experience.

    For this to be true all of the systems behind the scenes need to be functioning properly.  When it comes to monitoring there are two different perspectives and different purposes behind those perspectives.

    Perspective 1; User Experience

    You need to monitor the user experience to make sure the service is available and performing well for users.  Data from this will help you maintain a high quality user experience.

    Perspective 2; Individual Components

    You need to monitor the individual components behind the scene that provide the user experience so that when something fails, you know what that something is.  Data here is designed to help you improve MTTR and minimize downtime.

    Ultimately to satisfy me when it comes to monitoring, I need both of these perspectives.  These are my thoughts on the matter but I look forward to hearing others!

  • In my experience having worked at a managed services provider now for over 10 years is that for a service to be "available" it needs to be available for the users to interact with and it must be responsive enough for a good user experience.




    Perspective 1; User Experience

    You need to monitor the user experience to make sure the service is available and performing well for users.  Data from this will help you maintain a high quality user experience.



    That's the approach that I finally settled on. I like to step through a service from the user perspective and then, knowing what systems each step relies on, I then peak behind the curtain and...



    Perspective 2; Individual Components

    You need to monitor the individual components behind the scene that provide the user experience so that when something fails, you know what that something is.  Data here is designed to help you improve MTTR and minimize downtime.



    ...break down the service's availability based on what technology supports it. It can, of course, make dependency chains nightmarishly complex, so I'm still perfecting my methods. I don't think it's good to have two styles of monitoring coexist. For example, monitoring all networking equipment as a monolithic unit, and then monitoring the switches, ports, links and routes that a specific service relies on in addition. Then you get tons of alerts when stuff breaks (Nagios, anyone?).

    Then again, since tons of services rely on, say... a core router, it's of no use to have a hundred "Service Down!" alerts hit you in the face if one of the supervisors falls over (that's "supervisor" in the networking sense, and not in the "business management" sense - although if the former falls over, the latter tends to fallow suit).

    I think tomes could be written on the various methodologies. I should search for a publisher... =)

  • It can, of course, make dependency chains nightmarishly complex, so I'm still perfecting my methods. I don't think it's good to have two styles of monitoring coexist. For example, monitoring all networking equipment as a monolithic unit, and then monitoring the switches, ports, links and routes that a specific service relies on in addition. Then you get tons of alerts when stuff breaks (Nagios, anyone?).

    I completely agree on the difficulty in maintaining a dependency chain, especially in an environment like ours where things are constantly changing.  What I have found is helpful is monitoring things for status and performance data but not necessarily configuring alerts for every little things.  This helps me to avoid hundreds of alerts (as much as that can be avoided, it's a constant improvement process); however I still have the important data for troubleshooting when necessary.

    In the cases where I do get a large quantity of alerts (because it does still happen) I have learned to embrace it instead of hate it.  I take a look at all of the alerts that occurred at about the same time and try to figure out what they all have in common, normally something will jump out at me... it's the human correlation engine!

    However, at the end of the day I have to agree that the User Experience perspective certainly seems to yield better results with less effort.