Orion Advanced Traffic Analysis (aka DPI, NBAR, Flexible Netflow, Wireshark...)

We have been keeping an eye on the discussions here asking for improvement in terms of network traffic analysis, beyond what MIB 2 (Interface traffic) and Flow technology (NetFlow, sFlow, jFlow...) have to offer.

And we recently saw that NBAR support was voted #1 enhancement request for NTA - the Network Traffic Analyzer (see more about SolarWinds Network Traffic Analyzer here)
As we were thinking about this, we wondered whether the requirement was basically for NBAR - period - or if this was the sign of a larger requirement for better tools to ananalyze your traffic.
For example, we did a Traffic Analysis survey, asking if you used DPI-type solutions and we discovered that a large proportion was using Wireshark (not really a surprise, but was a good confirmation) or any other form of DPI product
DPI solutions used by SolarWinds customers.PNG
Also, I recently talked to some of you and confirmed that a significant proportion (actually higher than what the above chart suggests) either had a Deep Packet Inspection (DPI) solution in place (Riverbed Cascade/Opnet, LanCope, ...) or had this in their budget for this year.
So all this fueled a lot of thoughts and raised some questions on our side, on which we'd really appreciate your comments and answers:
  • Does Wireshark meet most of your needs?
  • Do you need Advanced Traffic Analysis for environments other than Cisco, where solutions like NBAR may not be available? In other words, do you need a vendor-agnostic solution?
  • All of you who own a DPI-type solution, mentioned their (prohibitive) cost as an issue. We are convinced that you don't need to be an Enterprise, able to spend $200K or more on these solutions, to really need them. Small and medium-size businesses also encounter challenges with their traffic which require deep analysis.
    Is it time for SolarWinds to commoditize that market and offer 80% of those features for 20% of the cost?

Before I'll let you go and express your thoughts and describe your experience, here is a bit more on how we think about this problem.

What are the needs for Advanced Traffic Analysis?

We see mainly 4:

Breakdown my traffic by application and user, like Netflow does ... but better

As convenient as Netflow technology is, it actually does a pretty average job at identifying your applications (unless you deploy Flexible Netflow):
    • Limited to static ports. Any app using dynamic ports will be invisible to (Net)flow technology
    • Ignores that many application use port 80 to go through firewalls and are actually NOT HTTP / Web applications
    • Is unable to identify reliably true Web applications: either because (Net)flow does not inspect the HTTP header and does not do URL extraction. Also, content networks such as YouTube.com (owned by Google) are identified by the Content Distribution Network they use as opposed to the Web site they really have (e.g. 1e100.net for YouTube.com...).
      See this typical question we have, pointing to this great post explaining why (Net)flow is really not the ultimate weapon and why Flexible Netflow is not the panacea either.

I need an aggregated view of the Quality of Experience rendered by my applications to my users

In a perfect world, users experiencing slowness in their application will open a ticket or send an email, and IT will know about it. But the world is not perfect and how many IT engineers discovered the hard way, as they were thrown under the bus by an email from a user to the CIO, that the QoE offered by IT is actually not that great, despite what they thought?

Of course there are solutions to simulate users connecting across the network (E.g. SolarWinds VNQM, based on Cisco's Ip SLA technology) as well as simulate the details of their transactions on their mission critical applications (e.g. SolarWinds WPM), but those are based on simulations and do not reflect the true experience of users.

Wouldn't it be great to have a dashboard looking at the traffic of all your users connecting to their applications and calculate the latency they REALLY experience and reporting this in near real-time, or on a daily, weekly, monthly, quarterly basis?

I need help troubleshooting slow access to applications

Your users are complaining about slowness when accessing some application? (or a QoE dashboard, as presented above, is keeping you informed about that)

The first things people do is use the basic tools they have, to try to troubleshoot:

    • Look at saturated interfaces that can explain slowness. Then go to  (Net)flow information to understand the nature of this traffic and see what non mission-critical traffic could be removed to avoid the saturation moving forward. The problem with this, is that a) it's not always easy to identify all interfaces on the path from the impacted user(s) to the application and b) excess of traffic is not always the cause of the slowness
    • Look at the devices along the path - from the switch the user(s) is(are) connected to, up to the server this application resides one, via all WAN routers - and see if any experience poor health explaining the slowness (CPU, memory, IO...)

Decent start, but what if this does not answer the question?

What if it's a mis-configuration? What if it's one particular transaction of this application, out of dozen, that is slow, how do you figure which one? What if the slow application is actually spread across multiple-tiers (application server, database, storage...) and what you thought was a "simple" analysis of the path between the user and the application server, happens to be more complicated and involve several back-end servers?

All these more complex but unfortunately real-life examples, are almost impossible to troubleshoot with the basic tools.

Security is key for us and I need to know exactly what peole do on my network

Who is accessing internal file shares? From the outside of my network, really?

How about browser-based file shares (e.g. dropbox)? Is last amount of internal material being copied over to those?

Are your users downloading copyrighted content from P2P media and storing it on company-owned asset, e.g. their workstation?

Do you have unusual traffic from unexpected countries? What is this traffic about?

You already have an Advanced Traffic Analysis (e.g. DPI-based solution)? Tell us what you think about it...

  • What use case(s) does it meet for you (from list above or other)?
  • Do you consider it cost effective?
  • How important is it to have these products integrated to platform such as an NMS / IT infrastructure management platform such as Orion?
  • Would you consider a solution integrated to Orion that would address your most important need at a fraction of the cost? Or do you need the full power of those expensive solutions? Tell us what the minimum bar is!

How does encrypted traffic impact the effectiveness of your Advanced Traffic Analysis (e.g. DPI-based solution)?

If you currently run some form of Advanced Traffic Analysis product (e.g. DPI-based solution), how does encrypted traffic impacts it.

Did encrypted traffic dictated where you deployed your packet capture probes? Are there areas of your network that carry encrypted traffic that you are blind on due to that?

Sniffing traffic, ok, but which one?


Let us know about your most important traffic types, on which to perform Advanced Traffic Analysis:

  • A) LAN traffic on corporate / Data Center (internal)
  • B) LAN traffic on remote site
  • C) WAN traffic (general)
  • D) WAN optimized traffic (e.g. Cisco WAAS, Riverbed, Bluecoat)
  • E) VM to VM traffic
  • F) Load balanced traffic (e,g, Cisco ACE, F5)
  • G) Virtualized traffic (e.g. analyse traffic in/out of an application hosted by a Cloud SP)
  • H) DMZ traffic

Sniffing traffic, ok, but how?

Agents, span or Tap or RITE?

  • The agent-based technique is about running agents on the OS - virtual or not - that hosts your mission critical applications. If your focus is the traffic that goes to your applications (as opposed to look at all traffic including the fully meshed opne that goes across all your sites sites), then agents are a good solution because they make the 3 techiques below unnecessary, but they require an invasive action on your OS by adding a component to it: what CPU/memory do they consume on the OS? Are they really only looking at the traffic? How do we upgrade 100's of them that are deployed on yuour server farm...
  • The Port Spanning / Mirroring technique is basically about high-jacking one port from your switch and dedicate it for mirroring all or part of the entire traffic of the switch. Then the management product listens to packets from this port and performs analysis, storage...

Pretty simple, because most switches support this now, just a commend top issue and a cable to connect and you can start drinking from the fire hose; but may have an impact on the switch and can't guarantee 100% packet captured on very heavily loaded switches.
Note that spanning a port is possible on a HW switch but also on a vSwitch within a virtual environment

  • Network Taps are basically pieces of hardware that you buy to replace the switch for this function (capturing packets). They are placed inline and have no influence on the switch and pretty much capture 100% of the traffic.
  • RITE - Router IP Traffic Export, is a Cisco technology like spanning a port on a switch, except that it's done on a router. Again, easy to deploy since it leverages an existing device, but it impacts it and won't work great at high speed. See this nice and short blog post

Tell us about your experience, your preferences, what's allowed and not allowed on your networks, as far as capturing packets for Advanced Traffic Analysis.

  • The Idea of DPI through Sniffing vs Netflow is simply the ability to send netflow from remote sites as well vs sniffing sending ,,,,,

    Netflow is for network management and statistics vs sniffing which is a different product aimed to deeper analysis of packets themselves.

    netflow FNF (v9) is flexible and more ideal for remote sites sending statistics.

    do you have FNF support for other fields like TTL/TCP Flags etc ?

  • DPI on physical interfaces takes some major horsepower... especially on 10G ports.  I suppose if the whole thing happened in virtual world it might be possible.  I'm not familiar with any all VM DPI???  I'm sure there is some.

  • Has anyone had any luck getting DPI to run on a VM?

  • I've tried Cisco's NBAR in the past in an attempt to throttle torrent traffic which was consuming too much of our Internet bandwidth.

    I found that NBAR was not efficient at it as it was not identifying torrent flows accurately.

    So the question here is that if it could not identify for policing, how accurate will the reporting be?

    What we ended up doing was buying a DPI box which did the job.

    Another question is, what about non-Cisco traffic?

    So i don't believe we should be comparing NBAR to DPI as that would be misleading.

    A software DPI engine would be a great add-on.

    Now how would you accomplish this? you'll need to capture the traffic. Putting NTA inline is out of the question, and so the 2 remaining options are: TAP and SPAN.

    Spanning traffic is know to cause packet drops on switch interfaces as it overflows the ASICs responsible for several ports at once. So if a user configured the destination port of the SPAN on the same ASIC responsible for a critical server port, he might get into performance issues, and more discarding ports on the TOP 10 NPM page :-)

    The only option left is TAPs. Perhaps a Solarwinds branded smart TAP? :-) At that point why stop there? Integrate them with SAM, and we could have a multi-TAP application performance solution that would allow us to easily identify what hop is impacting application performance.... And perhaps the ability to integrate with wireshark, handing off packets for analysis based on the filter criteria from NTA and/or SAM, which would be converted to a capture filter.

    The possibilities are many ... brain overheating :-)

  • Great interaction, thanks all, keep it coming.

    If someone has an opinion/need about the level of protocol decoding, we'd be happy to hear it. Either describe it here, or let me know, I'll ping you off line to discuss.

    Thanks again all or your time.

Thwack - Symbolize TM, R, and C