Network monitoring tracks the state of the network and is primarily looking for faults. At the most basic level, we want to know if devices and interfaces are "up." This is a simple binary reachability test. Your device is either reachable or not, it's either "up" or "down." However, just because a device is reachable does not mean there are no faults in the network. If a circuit is dropping packets, performance may be impacted and can make the circuit unusable even though it is "up." Time to stop thinking in terms of reachability and start thinking in terms of availability.

Availability is a service oriented concept that asks, "is the service this widget provides available to its users?" Is the service 100% available or is it degraded in some way? Here are some examples of situations that simple reachability monitoring has difficulty detecting:

  • A circuit is dropping packets somewhere in your WAN provider's network. It is "up," but throughput is reduced.
  • A circuit is congested and latency has shot through the roof. The circuit is still "up." There may not be anything technically wrong with the circuit, but it isn't really usable to the end users.
  • A router is using 100% of its memory. It is processing packets slowly or perhaps it is not able to add new routes. It may still be "up."
  • An Ethernet interface in a port aggregation group is down or perhaps it was one blocked by Spanning Tree. While an interface will be down, from the packet perspective everything is still "up."

In the first two cases, you will probably hear about it from the end users. In the last two cases, you might not know about them until something else changes in the network that causes a (possibly confusing) outage. And probably a bunch of trouble tickets.

Are you thinking in terms of availability or reachability? Is your NMS configured to match your mindset?