Product Blog

3 Posts authored by: jreves Employee

Woes of Flow

A poem for Joe

 

It uncovers source and destination

without hesitation.

Both port and address

to troubleshoot they will clearly assess.

Beware the bytes and packets

bundled in quintuplet jackets,

for they are accompanied by a wild hog

that will drown your network in a bog.

The hero boldly proclaims thrice,

sampling is not sacrifice!

He brings data to fight

but progress is slow in this plight.

 

Mav Turner

 

As network operators, one of the most common—and important—troubleshooting tasks revolves around tracking down bandwidth hogs consuming capacity in our network infrastructure. We have a wealth of data at our fingertips to accomplish this, but it’s sometimes challenging to reconcile into a clear picture.

 

Troubleshooting high utilization usually begins with an alert for exceeding a threshold. In the Orion Platform’s alerting facility, there are several conditions we can set up to identify these thresholds for action. The classic—and simple—approach is to set a threshold for utilization defined as a percentage of the available capacity. The Orion Platform also supports baselining utilization in a trailing window and setting adaptive thresholds. Next, you need to investigate to determine what’s driving utilization and decide what action to take.

 

Usually, the culprit is a particular application generating an unusual level of traffic. We can get some insights into application traffic volumes from a NetFlow analyzer tool like NetFlow Traffic Analyzer.

 

So, why don’t the volume measurements match exactly from these two sources of data? Aren’t interface utilization values the same as traffic volume data from NetFlow?

 

Let’s review the metrics we’re working with, and how this data comes to us.

 

Interface capacity—the rate at which we can move data through an interface—is modeled as an object in SNMP, and we pick that up from each interface as part of the discovery and import process into Network Performance Monitor network monitoring software. It can be overridden manually; some agents don’t populate that object in SNMP correctly.

 

Interface utilization is calculated from the difference in total data sent and received between polls, divided by the time interval between polls. The chipset provides a count of octets transmitted or received through the interface, and this value is exposed through SNMP. The Orion Platform polls it, then normalizes it to a rate at which the interface speed is expressed. That speed is usually “bits per second.”

 

SNMP Polled Utilization

 

The metrics reported by SNMP about data received or sent through the interface includes all traffic—layer two traffic that isn’t propagated beyond a router, as well as application traffic that is routed. Some of the data that flows through the interface isn’t application traffic. Examples include address resolution protocol traffic, some link-layer discovery protocols, some link-layer authentication protocols, some encapsulation protocols, some routing protocols, and some control/signaling protocols.

 

For a breakdown of application traffic, we look to flow technologies like NetFlow. Flow export and flow sampling technologies are normalized into a common flow record, which is populated with network and transport layer data. Basic NetFlow records include ICMP traffic, as well as TCP and UDP traffic. While it’s possible on some platforms to enable an extended template that includes metrics on layer 2 protocols, this is not the default behavior for NetFlow, or any of the other flow export protocols.

 

Top N Applications traffic volumes

 

The sFlow protocol takes samples from layer 2 frames, and forwards those. So, while it’s possible to parse out layer 2 protocols from sFlow sample packets, we generally normalize sFlow along with the flow export protocols to capture ICMP, TCP, and UDP traffic, and discard the layer 2 headers.

 

When we work with flow data, we’re focusing on the traffic that is generally most variable and represents the applications that most often drive that high utilization that we’re investigating. But you can see that in terms of the volumes represented, flow technologies are examining only a subset of the total utilization we see through SNMP polled values.

 

SNMP Polled versus application flow volumes

 

An additional consideration is timing. SNMP polling and NetFlow exports are designed to work on independent schedules and are not synchronized by design. Therefore, we may poll using SNMP every five minutes and average the rate of bandwidth utilization over that entire period. In the meantime, we may have NetFlow exports from our devices configured to send every minute, or we may be using sFlow and continuously receiving samples. Looking at the same one-minute period, we may see very different values at a particular interval for interface utilization and application traffic that is likely the main driver for our high utilization.

 

SNMP Polling and flow export over time intervals

 

If we’re using sFlow exclusively, our accuracy can be mathematically quantified. The accuracy of randomly sampled data—sFlow, or sampled NetFlow—depends solely on the number of samples arriving over a specific interval. For example, a sample arrival rate of ~1/sec for a 10G interface running at 35% utilization and sampling at 1:10000 yields an accuracy of +/-3.91% for one minute at a 98% confidence interval. That accuracy increases as utilization grows or over time as we receive a larger volume of samples. You can explore this in more detail using the sFlow Traffic Characterization Worksheet, available here: https://thwack.solarwinds.com/docs/DOC-203350

 

So, what’s the best way to think about the relationship between utilization and flow-reported application traffic?

 

  • Utilization is my leading indicator for interface capacity. This is the trigger for investigating bandwidth hogs.
  • Generally, utilization will alert me when there’s sustained traffic over my polling interval.
  • Application traffic volumes are almost always the driver for high utilization.
  • I should expect that the utilization metric and the application flow metrics will never be identical. The longer the time period, the closer they will track.
  • SNMP-based interface utilization provides the tools to answer the questions:
    • What is the capacity of the interface?
    • How much traffic is being sent or received over an interface?
    • How much of the capacity is being used?
  • Flow data provides the tools to answer the questions:
    • What application or applications?
    • How much, over what interval?
    • Where’s it coming from?
    • Where is it going?
    • What’s the trend over time?
    • How does this traffic compare to other applications?
    • How broadly am I seeing this application traffic in my network?

 

Where can I learn more about flow and utilization?

 

An Overview of Flow Technologies

https://www.youtube.com/watch?v=HJhQaMN1ddo

 

Visibility in the Data Center

https://thwack.solarwinds.com/community/thwackcamp-2018/visibility-in-the-data-center

 

Calculate interface bandwidth utilization

https://support.solarwinds.com/Success_Center/Network_Performance_Monitor_(NPM)/Knowledgebase_Articles/Calculate_interface_bandwidth_utilization

 

sFlow Traffic Characterization Worksheet

https://thwack.solarwinds.com/docs/DOC-203350

We’re delighted to announce the release of version 4.5 of NetFlow Traffic Analyzer (NTA)!

 

The latest release of SolarWinds® NetFlow Traffic Analyzer is designed to help create alerts based on application flows. In past releases, we could alert on the overall utilization of an interface and provide a view of the top talkers when the configured threshold was exceeded. In this release, you can set a threshold on the volume of a specific application in order to trigger an alert. We're making use of the Orion Platform alerting framework, so that flexibility is available to you.

 

You’ve outlined a small set of critical problems in multiple requests, and in this release, we’re delivering on the five most popular of these.

 

  • Application traffic exceeds a threshold – Alert triggered when we observe a specific application rate exceeds a user-defined threshold
  • Application traffic falls below a threshold – Alert that can provide visibility when an application “goes off the air” and stops communicating
  • Application traffic appears in the “TopN” list of applications – This alert triggers when application traffic increases suddenly relative to other applications
  • Application traffic drops from the “TopN” list of applications – Likewise, alert triggers for a sudden reduction relative to other applications
  • Flow data stops from a configured flow source – Alerts on the loss of flow instrumentation, and prompts to take action to help restore visibility

 

Contextual Alerting

The approach we're using to create alerts is built to guide users into a particular context—a source of flow where we see the application traffic—and then offers a simple user experience to create the alert.

To create an alert based upon any these triggers, we must first select a source of flow data as a point of reference. We can do these one of two ways.

 

We can visit the NTA Summary Page, and navigate to a particular source of flow data:

 

 

If the application of interest is in the TopN, we can expand it to see where this application is visible and select that source. That will take us to a detail page, which is already filtered by both application and source of the flow data.

 

We can also select our source of flow data directly in the Flow Navigator. We can build our alert based upon a node that reports flow, or upon a specific interface:

 

 

Once we have a context for an alert, we can select an application. If we use the "TopN Applications" resource, we have already identified both the application and the node or interface where it's visible.

Another way to arrive at this context can make use of the Flow Navigator, where we can explicitly select the application we’re interested in:

 

 

 

We can select either Applications, or NBAR2 Applications, to help describe the traffic. With the context now fully described, we are able to open the "Create a Flow Alert" panel and create our first alert:

 

 

At the top of the panel, we'll see the source of the flow data that we'll evaluate, and a default alert name prefix. We can customize the alert name to help make searching simpler. The severity of the alert is configurable:

 

For the Trigger Condition, we'll select one of the options described above. In this case, we'll select "Application Traffic exceeds Threshold," and we'll set a threshold of 50MBps on the ingress. We'll evaluate the last five minutes of traffic; this is configurable. This threshold will trigger when our traffic rate averages greater than 50MBps over the five min. time period.

 

Finally, we can specify one or several protocols; if we specify more than one, we'll sum the traffic volumes for all the protocols.

 

To create the alert, there are two options. We can select the "Create Alert" immediately, and this will simply log the alert when it triggers. Or, we can check the box to open the alert in the Advanced Alert Editor and then select "Create Alert." Selecting this option will redirect us to the last step in the "Add New Alert" wizard, where we can modify the trigger actions, reset actions, or time of day schedule.

 

 

The trigger condition is an advanced SWQL query, pre-populated with the contextual information on the source and application.

 

Before submitting this new alert, we'll see a message indicating whether the alert will trigger immediately.

 

Practical Alert Scenarios

Use the "exceeds threshold" alert for application traffic levels that average above or below the specified threshold.

Use the operation for ">" (greater than) or "<=" (less than or equal to) to determine then you can alert above or below the threshold. For example:

  • To determine when backup application traffic is running out of schedule
  • To identify large file transfers in the middle of the day
  • To identify DDOS attacks, or when Port 0 traffic is present at all

Use the <= “exceeds threshold” to help detect when an application server process goes offline and stops sending traffic.

  • The application service may have crashed
  • An intermediate connectivity problem (firewall or outage) may have reduced traffic

Use alerts related to applications appearing in—or dropping out of—the TopN can be useful for detecting sudden changes in traffic volume relative to other applications. Examples include:

  • Detecting streaming or peer-to-peer file sharing applications that are transient
  • Detecting changes in the mix of applications that usually traverse an interface

 

You can also set up an alert for each of your NetFlow sources to help take action if the configuration is modified, or firewall rules block flow traffic.

 

User Experience Improvements

This release of NTA also includes a number of small but significant improvements in the user interface to help enhance scalability and improve ease of use. Several long lists are now uniformly ordered, and we’ve changed how we label certain features to be clearer in the navigation.

 

Additional Resources

Check out the Release Notes, download the new release on the Customer Portal, and get additional help with the upgrade at the Success Center.

 

You can see these new features in action in the webcast, “Up, Down, and Gone: A Tale of Applications and Flow.”

 

This is an initial introduction of the traffic alerting feature. Be sure to enter additional feature requests and expanded functionality that you'd like to see with this capability!

 

jreves

NetFlow Traffic Analyzer

Faster. Leaner. More Secure.

 

The new NetFlow Traffic Analyzer leverages the power of columnstore technology in MS SQL Server to deliver answers to your flow analysis questions faster than ever before. MS SQL 2016 and later runs in a more efficient footprint than previous flow storage technologies, making better use of your infrastructure. Support for TLS 1.2 communication channels and monitoring of TCP and UDP Port 0 traffic helps to secure your environment.

 

Version 4.4 also introduces a new installation process to confirm that you have the necessary prerequisites, and to guide you through the installation and configuration process.

 

NTA 4.4 is now available in the Customer Portal. Check out the Release Notes for an overview of the features.

 

Faster

The latest release of NTA makes use of Microsoft’s latest version of their SQL columnstore based flow storage database.  Columnstore databases organized and query data by column, rather than row index. They are the optimal technology for large-scale data warehouse repositories, like massive volumes of individual flow records. Our testing and our beta customer experiences indicate that columnstore indexes support substantial performance improvements in both querying data, and in data compression efficiency.

 

NTA was an early adopter of columnstore technology to enhance the performance of our flow storage database. As Microsoft’s columnstore solutions have matured, we’ve chosen to adopt the MS SQL 2016 and later versions as the supported flow storage technology. That offers our customers the ability to standardize on MS SQL across the Orion platform, and to manage their monitoring data using a common set of tools with common expertise. We’ve made deployment and support simpler, more robust, and more performant.

 

Leaner

This same columnstore technology also runs more efficiently with the existing resource footprint. This solution builds and maintains columnstore indexes in memory, and then manages bulk record insertions with much less intensive I/O to the disk storage. CPU required to build indexes is also substantially less intensive than our previous versions. As a result, this version will make better use of the same resources to run more efficiently.

 

More Secure

This version of NTA supports TLS 1.2 communication channels, required in many environments to secure communications with client users.

 

Beginning in this version, NTA will explicitly monitor network flows that are destined to TCP or UDP service port 0. Traffic that’s addressed to TCP or UDP port 0 is either malformed – or malicious traffic. This port is reserved for internal use, and network traffic on the wire should never appear addressed to this port. By highlighting and tracking flows addressed to port 0, NTA helps network administrators to identify sources of malicious traffic that may be attacking hosts in their network, and providing the information they need to shut that traffic down.

 

NTA will surface port 0 traffic as a distinct application, so the information is available in all application resources.

NTA Port 0 Traffic

Supported Database Configurations

This version of NTA maintains a separate database for Flow Storage. NPM also maintains the Orion database for device and interface data. Both of these databases are built in MS SQL instances.

 

New installations of NTA and upgrades to version 4.4 and later will require an instance of MS SQL 2016 Service Pack 1 or later version for flow storage. For evaluation, the express edition is supported. For production deployments, we support the Standard and Enterprise editions.

 

When upgrading to this version from older version on the FastBit database, data migration is not supported. This upgrade will build out a new, empty database in the new MS SQL instance.  The existing flow data in the FastBit database will not be deleted or modified in any way. That data can be archived for regulatory requirements, and customers can run older product versions in evaluation mode to temporarily access the data.

 

In the current NTA product, we require a separate dedicated server for Flow Storage. The simplest upgrade would use that dedicated server with the new release to install an instance of MS SQL 2016 SP1 or later for flow storage. Many of our customers will be interested in running both the Orion database and the NTA Flow Storage database in the same MS SQL instance. We support that, but for most customers that will take some planning to consolidate and to appropriately size that instance to support both databases.

 

Here's a more detailed discussion of NTA's New MS SQL Based Flow Storage Database. Also, a knowledge base article on NTA 4.4 Adoption is available, with frequently asked questions.

 

We’re doing some testing now to provide some performance guidance for key performance indicators to monitor. One of the benefits of using MS SQL technology for both of these databases is that there are many common tools and techniques available to monitor and tune MS SQL databases. We plan to provide guidance for both monitoring, and deployment planning.

 

Conclusion

Please visit the NetFlow Traffic Analyzer Forum on THWACK to discuss your experiences and new feature requests for NTA.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.