We've been using the QoE functionality within NPM to generate client reports on Network and Application responsiveness for several months now. Almost since QoE was deployed we have been recording spikes in ART throughout the day and evening and I have been tasked with finding out where the issue lies.
Our topology consists of a dedicated packet analysis sensor deployed into our DMZ which is connected to a span port on the external VLAN. All inbound traffic for this VLAN is forwarded to a HA pair of f5 BIG-IP appliances which perform SSL offloading as well as load balancing for the internal network beyond.
While QoE was able to provide some insights, it is difficult to see exactly where the issue lies as the entire conversation from QoE’s POV is encrypted. Thankfully as the f5’s terminate the SSL session we decided to deploy f5’s AVR module and gain a better view of what’s happening.
Unfortunately, this further muddied the waters as the reports generated by AVR are incomparable to those produced by QoE. QoE as an example regularly reports much higher ART than what the BIG-IP appliance is reporting (see attached reports).
While its clear the 2 reports are different, Splunk being more granular due to the fact the f5 has an unencrypted view of the individual elements, you can see that overall both graphs are wildly different with the Solarwinds report peaking at 2.25 seconds and the f5 reported latency peaking at less than 100ms.
Given the QoE sensor and the f5 appliance are privy to the same data and both calculate response times based on the time to first byte (according to their docs) I would expect the latency timings at least to be in the same ball park.
In an attempt to rule out the f5 I decided to capture a 30 minute sample of data from the QoE sensor using Wireshark. I then fed this capture into a packet analysis tool (Riverbed – SteelCentral Packet Analyzer PE) to see if we were able to replicate the above reports using this data. We were able to generate a comparable Network Response time report, however we were not able to produce a ART report that was in any way comparable.
To further confuse the issue I fed the capture file into a Response Time viewer tool produced by Solarwinds and the average application response time listed was considerably lower than that of the Solarwinds QoE report covering the same period.
I am now at a loss of which tool to trust and would like some advice. I expect the primary reason for the differences is we are not comparing Apples to Apples.
I have reached our to Solarwinds support and been given documents which include links to a video explaining QoE and how RTT and ART is calculated for a single request. Unfortunately the documentation does not detail how QoE calculates the average of this over a 5 minute period, is it sampled or based on the raw output of the span port?. The same is true of f5 the details relating to AVR are limited.
Has anyone else had similar issues generating consistent ART reporting between vendors?
Any advice or guidance would be appreciated.
Thanks,
Rob