Showing results for 
Search instead for 
Did you mean: 
Create Post

Monitoring and Troubleshooting Critical Apps on Saturated Circuits

Level 15

It's 2014 and simple bandwidth saturation continues to be one of the most common network problems.  Every time we make advances in data transit rates, we seem to find new ways to use more bandwidth.  Or is it the other way around?  QoS helps us use the bandwidth we have more effectively, but adds a layer of complexity.  Another interesting side effect of QoS in many organizations is that, by successfully protecting critical application traffic, QoS allows you to defer bandwidth upgrades.  By the numbers, keeping your circuits saturated absolutely makes business sense.  Who wants to pay for bandwidth you don't use?  The strategy is so effective that saturated circuits are becoming the new normal.

When bandwidth saturation is the normal state and QoS is responsible for protecting our critical traffic, how do we monitor and troubleshoot slowness?  We no longer have an available bandwidth buffer that we can monitor as an indication of the health of the circuit and the quality of the service we're providing to all of the transit traffic.  Saturated circuits require a different strategy.

Rummaging Through the Tool Box

Let's take a look in our tool box and see what makes sense to use with saturated circuits:

Ping - Ping falls in only one QoS category and it generally isn't the same category as your important traffic.  Accordingly, ping is not a very good indicator of health.

Bandwidth Utilization - High bandwidth utilization no longer means bad service.  In fact, arguably, high bandwidth utilization is exactly what you're trying to achieve.  This is also not a good indicator of health.

NTA - Knowing what different types of traffic make up your total traffic load continues to be important.  Knowing how much traffic is being sent or received within each QoS class and what types of traffic make up the contents of that QoS class are more important than ever.  Additionally, QoS queue monitoring provides a window into when QoS is making the decision to drop your traffic.

VNQM - VNQM uses IP SLA operations that can masquerade themselves as most other types of traffic.  You can monitoring how VOIP, SQL, HTTP, and other traffic are each uniquely serviced.

QoE - Where VNQM uses synthetic transactions (read: monitoring system generated) to constantly measure the service you're providing to a particular type of traffic, QoE watches your real transit traffic to see what kind of service you're actually providing your users.

Some of our older tools no longer have the fidelity and granularity we need to help.  Ping and simple bandwidth utilization are no longer good indicators of health.  We have to rely on more advanced testing tools like VNQM and QoE.  NTA remains our detailed view into what our traffic actually consists of, with a shift of importance to the QoS metrics.

It's interesting to me that QOE overlaps in some ways with NTA and in other ways with VNQM.  Like NTA, QOE can tell you about what traffic is traversing a specific link (using NPAS) or a specific device (SPAS).  NTA is much better at it than QOE though.  Like VNQM, QOE allows you to continually monitor the quality of service your network is providing but in a different way.  Where VNQM sends synthetic traffic and analyzes the results, QOE analyzes real end user traffic.  QOE's strong suit is providing visibility into the service level you're really providing to your end users and then breaking that down into network service time and application service time.

Now that we've reviewed our tools in this new light, let's look at how our monitoring and troubleshooting processes have changed.

Monitoring Saturated Circuits

Here's an example of how we traditionally monitored for slowness:

1) Ping across the circuit and watch for high latency.

2) Monitor interfaces and watch for high bandwidth utilization.

3) Setup flow exports so that when an issue occurs, you can understand the traffic.

When circuit saturation is expected or desirable, the steps change:

1a) Setup IP SLA operations to test network service at all times and/or

1b) Setup QoE to monitor real user traffic whenever it is occurring.

2) Setup flow exports so that when an issue occurs, you can understand the traffic.

One thing that is very important when all your circuits are full is to be able to ignore things that don't matter.  If bittorrent is running slow for your end users, do you care?  When all of your circuits are running hot, you need to be able to detect when critical applications are impacting your user's ability to do their job, and ignore the rest.

Troubleshooting Saturated Circuits

Here's an example of how we traditionally troubleshot network slowness:

1) Receive high latency alert, high bandwidth utilization alert, or user complaint.

2) Isolate slowness to a specific circuit by finding the high bandwidth interface in the transit path.

3) Analyse flow data to determine what new traffic is causing the congestion.

4) Find some way (nice or not nice) to stop the offending traffic.

When circuit saturation is the norm, troubleshooting looks more like this:

1) Receive alert for poor service from VNQM or QOE, or receive a user complaint.

2) Isolate slowness to a specific circuit by finding high bandwidth interfaces in the transit path and comparing which endpoints or IP SLA operations are experiencing the slowness.

3) Determine which QoS class to focus on by mapping the type of traffic affected to a specific QoS class.

4) Determine if the QoS class is being affected by competition from other traffic in the same class or in a higher priority class.

5) Analyze flow data within the offending QoS class to find the traffic causing the congestion.

6) Find some way (nice or not nice) to stop the offending traffic.

Your Thoughts?

Everyone has different tools in their toolbox and different environments necessitate different approaches.  What tools do you use to monitor and troubleshoot your saturated circuits?  What process works for your environment?

1 Comment
Level 15

great post cobrien.

I am very interested in hearing from the community about this in general (is this how you view the troubleshooting process) but also more specifically:

- How much time do you and your teams spend on this type of activity

- what tools do you commonly use to perform these activities

- if your network is used to connect your IT users to SaaS and IaaS-delivered applications, what does this change for you? What type of visibility do you need for the portion of the network that you don't own (the portion closer to the IaaS and SaaS providers)

- How could Orion products better help you with theses (think new features, UI and navigation optimized for troubleshooting, missing device support)

- how many of you are using QoS on your network. How often is QoS involved in trouble (e.g. slowness)? What apps are most commonly supported by QoS (VoIP? Others?)

About the Author
Lifelong technology enthusiast. Network Engineer turned Product Manager for network products. By geeks, for geeks! I started my career as a call center agent at a wireless ISP. I moved into the Network Operations Center to operationally support their network. I moved to another company to be a Network Engineer, and fulfilled that role at several different companies in different verticals including Healthcare, Software, and Finance. Eventually, I found my calling as a PM, where I work with all of the functions of a business, and particularly Development, to determine what to build next to deliver the most value to our customers.