Woes of Flow
A poem for Joe
It uncovers source and destination
Both port and address
to troubleshoot they will clearly assess.
Beware the bytes and packets
bundled in quintuplet jackets,
for they are accompanied by a wild hog
that will drown your network in a bog.
The hero boldly proclaims thrice,
sampling is not sacrifice!
He brings data to fight
but progress is slow in this plight.
As network operators, one of the most common—and important—troubleshooting tasks revolves around tracking down bandwidth hogs consuming capacity in our network infrastructure. We have a wealth of data at our fingertips to accomplish this, but it’s sometimes challenging to reconcile into a clear picture.
Troubleshooting high utilization usually begins with an alert for exceeding a threshold. In the Orion Platform’s alerting facility, there are several conditions we can set up to identify these thresholds for action. The classic—and simple—approach is to set a threshold for utilization defined as a percentage of the available capacity. The Orion Platform also supports baselining utilization in a trailing window and setting adaptive thresholds. Next, you need to investigate to determine what’s driving utilization and decide what action to take.
Usually, the culprit is a particular application generating an unusual level of traffic. We can get some insights into application traffic volumes from a NetFlow analyzer tool like NetFlow Traffic Analyzer.
So, why don’t the volume measurements match exactly from these two sources of data? Aren’t interface utilization values the same as traffic volume data from NetFlow?
Let’s review the metrics we’re working with, and how this data comes to us.
Interface capacity—the rate at which we can move data through an interface—is modeled as an object in SNMP, and we pick that up from each interface as part of the discovery and import process into Network Performance Monitor network monitoring software. It can be overridden manually; some agents don’t populate that object in SNMP correctly.
Interface utilization is calculated from the difference in total data sent and received between polls, divided by the time interval between polls. The chipset provides a count of octets transmitted or received through the interface, and this value is exposed through SNMP. The Orion Platform polls it, then normalizes it to a rate at which the interface speed is expressed. That speed is usually “bits per second.”
The metrics reported by SNMP about data received or sent through the interface includes all traffic—layer two traffic that isn’t propagated beyond a router, as well as application traffic that is routed. Some of the data that flows through the interface isn’t application traffic. Examples include address resolution protocol traffic, some link-layer discovery protocols, some link-layer authentication protocols, some encapsulation protocols, some routing protocols, and some control/signaling protocols.
For a breakdown of application traffic, we look to flow technologies like NetFlow. Flow export and flow sampling technologies are normalized into a common flow record, which is populated with network and transport layer data. Basic NetFlow records include ICMP traffic, as well as TCP and UDP traffic. While it’s possible on some platforms to enable an extended template that includes metrics on layer 2 protocols, this is not the default behavior for NetFlow, or any of the other flow export protocols.
The sFlow protocol takes samples from layer 2 frames, and forwards those. So, while it’s possible to parse out layer 2 protocols from sFlow sample packets, we generally normalize sFlow along with the flow export protocols to capture ICMP, TCP, and UDP traffic, and discard the layer 2 headers.
When we work with flow data, we’re focusing on the traffic that is generally most variable and represents the applications that most often drive that high utilization that we’re investigating. But you can see that in terms of the volumes represented, flow technologies are examining only a subset of the total utilization we see through SNMP polled values.
An additional consideration is timing. SNMP polling and NetFlow exports are designed to work on independent schedules and are not synchronized by design. Therefore, we may poll using SNMP every five minutes and average the rate of bandwidth utilization over that entire period. In the meantime, we may have NetFlow exports from our devices configured to send every minute, or we may be using sFlow and continuously receiving samples. Looking at the same one-minute period, we may see very different values at a particular interval for interface utilization and application traffic that is likely the main driver for our high utilization.
If we’re using sFlow exclusively, our accuracy can be mathematically quantified. The accuracy of randomly sampled data—sFlow, or sampled NetFlow—depends solely on the number of samples arriving over a specific interval. For example, a sample arrival rate of ~1/sec for a 10G interface running at 35% utilization and sampling at 1:10000 yields an accuracy of +/-3.91% for one minute at a 98% confidence interval. That accuracy increases as utilization grows or over time as we receive a larger volume of samples. You can explore this in more detail using the sFlow Traffic Characterization Worksheet, available here: https://thwack.solarwinds.com/docs/DOC-203350
So, what’s the best way to think about the relationship between utilization and flow-reported application traffic?
- Utilization is my leading indicator for interface capacity. This is the trigger for investigating bandwidth hogs.
- Generally, utilization will alert me when there’s sustained traffic over my polling interval.
- Application traffic volumes are almost always the driver for high utilization.
- I should expect that the utilization metric and the application flow metrics will never be identical. The longer the time period, the closer they will track.
- SNMP-based interface utilization provides the tools to answer the questions:
- What is the capacity of the interface?
- How much traffic is being sent or received over an interface?
- How much of the capacity is being used?
- Flow data provides the tools to answer the questions:
- What application or applications?
- How much, over what interval?
- Where’s it coming from?
- Where is it going?
- What’s the trend over time?
- How does this traffic compare to other applications?
- How broadly am I seeing this application traffic in my network?
Where can I learn more about flow and utilization?
An Overview of Flow Technologies
Visibility in the Data Center
Calculate interface bandwidth utilization
sFlow Traffic Characterization Worksheet