8 Replies Latest reply on Mar 1, 2012 3:08 PM by mdriskell

    Do you gather evidence on a service provider to keep them accountable?

    Nonapeptide

      Last week I bought a new virtual server to be able to hold clients that rapidly expand or have sudden bursts of traffic (CloudFlare FTW!). While reviewing the service provider, I started thinking about... you guessed it, Service level agreements - do they ever *not* give us headaches? and Re: What does it take to satisfy you when testing if a service is available or not?!

      Except I'm revising my thoughts on how to blend those two concerns. Here's my question:

      How far do you go when monitoring a service provider's infrastructure?

      Currently, my uptime/service-checker-du-jour is Pingdom. They do a good job for the basics and I don't have to host the service on my own infrastructure. They're a trusted service with major brand recognition and are independent of me and my own business so a service provider would have a harder time disputing any of the results I show them. The trouble is that I pay based on the number of checks that I use. So in essence, I'm paying a fee for a service that checks to make sure I'm getting what I'm paying for out of another service provider. Even if I had my own infrastructure set up for monitoring (I might make a tricked out monitoring server of my own), I'd still be using my time and effort to check the service provider.

      The idea is that, in the case of service denigration or an outright outage, it's best to come to the table with your own data. Especially if it's seriously cross-checked and thoroughly inclusive of many different data points. I've heard of SLAs that demand a certain amount of your own data before the provider will "pay out."

      The problem is that I get a bit skeeved that I'm having to spend time and money to ensure that I'm getting what I'm paying for and to make it more likely that, in the event of an SLA infraction, my business and clients get treated properly. What do you think?

      Do you monitor your service provider's equipment? Do you keep an eye on as much of their forward facing infrastructure as possible and keep historical data? Do you prefer to come to the table with as much of your own observations about the service provider's performance as possible? Have you ever been put in an awkward situation by a service provider who's monitoring data did not match your experience with them?

      Tell me about your perspective and experiences. Maybe I'm just paranoid. Gotta go... I think I hear the black helicopters getting closer.

        • Re: Do you gather evidence on a service provider to keep them accountable?
          mjchacko

          I wonder how many vm's you have to have in the cloud before you start thinking, there's too much at stake here and I should have my own network performance software in the cloud? Having an Orion VM in the cloud monitoring your other vm servers seems like one solution. Wondering if this has been thought of before (& dismissed before implementing) or tried before ( with good/bad/ugly results).

          Either way, I would be interested to hear comments :)

          ~MC

            • Re: Do you gather evidence on a service provider to keep them accountable?
              Nonapeptide

              I wonder how many vm's you have to have in the cloud before you start thinking, there's too much at stake here and I should have my own network performance software in the cloud?

              My preference would be to have a physical monitoring node hosted in a facility completely away from the providers that I'm testing. However, now that you mention it, having another monitoring node in the same cloud as my VMs would also be a good tool so that the performance metrics that the cloud node is seeing will then take into account the internal network of the provider. However, being that it's a cloud, who knows what infrastructure changes will be taking place underneath the VM, where the VM will be over the course of its life and if any metrics that it measures can have any meaning without greater insight into the virtual infrastructure than what a provider will likely want to provide.

              I think it has merit though.

            • Re: Do you gather evidence on a service provider to keep them accountable?
              smartd
              We have a 550 site MPLS network that is fully managed.  We have a 4 hour SLA with them.  They will do everything the can to put a site on "customer time" to turn off the SLA clock.  We use the user contributed "Node Down Time" and event logs to watch our routers (using the custom attribute "WAN" fo filter)  Another attribute called "Monitor" is used to keep problem sites on an uptime list to keep an eye on them.  We use Netflow to dig into high utilization sites.
                • Re: Do you gather evidence on a service provider to keep them accountable?
                  Nonapeptide

                  We have a 550 site MPLS network that is fully managed.  We have a 4 hour SLA with them.  They will do everything the can to put a site on "customer time" to turn off the SLA clock.  We use the user contributed "Node Down Time" and event logs to watch our routers (using the custom attribute "WAN" fo filter)  Another attribute called "Monitor" is used to keep problem sites on an uptime list to keep an eye on them.  We use Netflow to dig into high utilization sites.
                  Wow, that sounds like a huge and expensive network, so it really would be a power play between service provider and you, the customer, to pass the ticking clock back and forth.

                  Does the service provider generally accept your data when handling an SLA dispute? Or do they ever bring their own service metrics to the table and compare the two, preferring their own data to yours?

                    • Re: Do you gather evidence on a service provider to keep them accountable?
                      smartd

                      Fighting over a latency SLA is tough.  We usually end up using iPerf with the carrier tech folks to prove service issues.  This is not a regular issue.  The common issue is a circuit down lasting for over 4 hours.  The will typically say that power needs to be verified and put the ticket on "customer time".  That means that we need to have a secondary contact for every site that is available 7 x 24 so that they cannot say that they couldn't reach the point of contact.  It is then our problem to determine when the site opens, and to make sure the site verifies power.  For big/imporant sites, one of our  technicians may be dispatched to expedite resolution.  We accept "customer time" until the site is opened and the site verifies power.  Then we contact the carrier to get the ticket active again.

                      At 8:30am every morning we have a NOC meeting, reviewing Orion for down sites using Gob's "How Long a Node was Down".  His resource shows us if a site is bouncing... using the "Count" value and the total minutes of outage.  We verify that there is a carrier ticket for every site that is down, and the status of the ticket.  We will often send a copy of the latency/packet loss graph as backup if we find a discrepancy.

                      We also use the daily CA Concord eHealth reports that our Carrier makes available.  These do a better job of recording and highlighting ongoing error issues than Orion does.   This is an area that NPM could improve.  This is for looking a chronic issues and opening a chronic ticket with the carrier.

                      Just to understand how important this is, our last SLA check with over $10,000, not to mention getting site up quicker.

                      Oh, we won over the carrier on our data when to Gob resource showed a site bouncing multiple times.  The carrier pings every 15 minutes and doesn't collect SNMP interface down traps or syslog.  When they look at the router logs, our stats are confirmed.  So downtime and bouncing are believed.  Performance SLA is tougher.

                        • Re: Do you gather evidence on a service provider to keep them accountable?
                          Nonapeptide

                          The common issue is a circuit down lasting for over 4 hours.  The will typically say that power needs to be verified and put the ticket on "customer time". 

                          Tricksy, Very tricksy. =)

                          It is then our problem to determine when the site opens, and to make sure the site verifies power.

                          When you say "verifies power" do you simply mean that the CPE is powered up? Like your CSU/DSU?

                          The carrier pings every 15 minutes and doesn't collect SNMP interface down traps or syslog.

                          I stared at the screen for a few moments before reading that again. That's just... I don't even...

                          Glad they seem to be won over by your due diligence. And yes, I can see how performance might be a more troublesome issue. It's easier to determine whose equipment is down or up. Not so easy to pin the blame on poor performance.

                            • Re: Do you gather evidence on a service provider to keep them accountable?
                              smartd


                              When you say "verifies power" do you simply mean that the CPE is powered up? Like your CSU/DSU?

                               



                              Yes.  Carrier requires an out of band modem that their system "S.M.A.R.T.S" will automatically contact when router quits responding to pings.  If they cannot gather statistics automatically out-of-band, they blame power.  Also blame power if smartJack or CSU won't loop.

                              On the SNMP traps.  They say they DO use them, but history has shown that this isn't the case.  Their system relies on ICMP polling.

                                • Re: Do you gather evidence on a service provider to keep them accountable?
                                  mdriskell

                                  Yes.  Carrier requires an out of band modem that their system "S.M.A.R.T.S" will automatically contact when router quits responding to pings.  If they cannot gather statistics automatically out-of-band, they blame power.  Also blame power if smartJack or CSU won't loop.

                                  My favorite is when they tell you they can test clean to the CSU and then I talk to the site and find out the SmartJack is dead (no lights) what exactly were you looping to?

                                  I fought with carriers for over a decade and I have pulled metrics and statistics to back up my statements that over the course of two years 95% of our circuit outages were carrier related......That being said we had over 40% of those pushed back on us by the carrier that we had to prove it wasn't our equipment.  I have had many conversations with various account reps from Verizon, AT&T, Sprint, etc and they all say the same thing....its not just you.

                                  God forbid if it is not a layer 1 issue.  You can speak person to person to tell someone my T1 is physically up but I'm not getting LMI....20 minutes later your ticket goes through an autotester and is closed because they report the T1 is up.  I reported the T1 is up...we already knew that...we aren't passing traffic.

                                  I have had to create many reports over the years to try and recoup SLA money for outages and it's always a challenge.