35 Replies Latest reply on Dec 12, 2014 9:49 AM by jkump

    Do you monitor your systems bidirectionally? Ingress and egress traffic to and from your important applications?

    Nonapeptide

      This question is a near kinsman to my last question "When monitoring service uptime and availability, do you use geographically diverse monitoring stations? Can you have too many?" The reason is that, while I had a litany of external monitors keeping a close watch on my datacenter space, it still wasn't enough! Oh, it was enough from a quantity standpoint, but not from a quality standpoint.

       

      You see, I had a little problem with downtime at my datacenter. My external monitors would show unexplainable drops for one to three minutes a few times per month. Sometimes a couple times per week. This concerned me, and one client had the misfortune of attempting to get to their site (hosted on a server in the rack) at the exact moment a mysterious outage was experienced. It lasted for 30 seconds or so, and availability came right back.

       

      I presented my datacenter with an avalanche of traceroute information and awaited a response. The response was... unexpected. In essence I received a reply that said "Our systems aren't reporting any downtime. No other clients are complaining." and the subject was dropped rather abruptly.

       

      One thing I couldn't prove or disprove was wether or not traffic leaving my rackspace was successfully reaching any destinations. I had no monitoring station within the rackspace, using my business switches, routers, and firewalls, that was testing connections to external services. Not even a simple ping to Google.com. I had so focused on ingress traffic that egress traffic was ignored. If I could have shown simultaneous ingress and egress monitors that had correlating downtimes and traceroutes that starred out at the same hop, the evidence would have been insurmountable.

       

      When monitoring your important systems and applications, do you perform bidirectional tests? Do you make sure that traffic can travel both ingress and egress to your systems? Do you correlate downtimes between the different traffic directions? Furthermore, have you ever had to present this kind of evidence to a service provider to prove an outage?