cancel
Showing results for 
Search instead for 
Did you mean: 

Is UX Monitoring the future of Network Monitoring?

Level 9

Is User Experience (UX) monitoring going to be the future of network monitoring? I think that the changing nature of networking is going to mean that our devices can tell us much more about what’s going on. This will change the way we think about network monitoring.


Historically we’ve focused on device & interface stats. Those tell us how our systems are performing, but don't tell us much about the end-user experience. SNMP is great for collecting device & interface counters, but it doesn't say much about the applications.


NetFlow made our lives better by giving us visibility into the traffic mix on the wire. But it couldn't say much about whether the application or the network was the pain point. We need to go deeper into analysing traffic. We've done that with network sniffers, and tools like Solarwinds Quality of Experience help make it accessible. But we could only look at a limited number of points in the network. Typical routers & switches don't look deep into the traffic flows, and can't tell us much.


This is starting to change. The new SD-WAN (Software-Defined WAN) vendors do deep inspection of application performance. They use this to decide how to steer traffic. This means they’ve got all sorts of statistics on the user experience, and they make this data available via API. So in theory we could also plug this data into our network monitoring systems to see how apps are performing across the network. The trick will be in getting those integrations to work, and making sense of it all.


There are many challenges in making this all work. Right now all the SD-WAN vendors will have their own APIs and data exchange formats. We don't yet have standardised measures of performance either. Voice has MOS, although there are arguments about how valid it is. We don't yet have an equivalent for apps like HTTP or SQL.


Standardising around SNMP took time, and it can still be painful today. But I'm hopeful that we'll figure it out. How would it change the way you look at network monitoring if we could measure the user experience from almost any network device? Will we even be able to make sense of all that data? I sure hope so.

11 Comments
Level 15

I'll bite.

I try to "eat my own dogfood" where if I am suggesting to an application or network owner that we can better monitoring their world - it's because I've already tested out said feature on my Orion apps or servers or network. I've found it easy to apply SAM monitors for Orion, modified one for Polling Engines, added the MSMQ monitors, and added a TCP1433 and TCP 17777 port check to various Orion servers. While this has been helpful, if we apply the "lets get beyond SNMP and WMI" mantra then we find ourselves more in the NTA and QoE tools. Still thinking of monitoring Orion network UX, it is possible to setup NTA reports and dashboards to filter on only Orion application activity but I personally have only used it when the links were near saturation to confirm who the top 10 culprits were vs deep dive ux monitoring, but maybe that's our sampling rate?

On the other hand the moment, and frankly even during the beta period, we got our hands on Solarwind's QoE feature we have seen immediate value -I used up the 10 free server-based capture agents on all our Orion servers and immediately had dashboards showing us which servers were seeing more network (WAN) delay and which protocols themselves were adding tremendous delay like MSMQ. From a true UX perspective I now know we should be seeing 1.1 Million snmp packets a day  travel between our Orion servers or that our CIFS network latency is 6ms vs protocol latency of 278ms.

Screen Shot 2015-03-09 at 10.27.37 AM.png


Cool stats but not helpful unless we can baseline and report on change in behaviors - for just this purpose Soalrwinds has included the Dynamic Baseline calculation used in SAM for QoE application as well. To set our baseline just open QoE settings for that application, drop down the Thresholds section, check the box to Override Orion General Thresholds, and then either select the Use Dynamic Baseline Thresholds button for simple high water thresholds based on the last 7 days of actual activity. For SNMP I might want to override the operator from greater then to less then and use the Latest Baseline Details link to dig deeper and set the values as -2 and -3 sigma for low water mark thresholds.



Screen Shot 2015-03-09 at 10.14.01 AM.png


QoE has been better then NTA for our UX monitoring as it has no sampling rate and for it's simple placement of capture agents/taps. So for now if an application owner wants app and network performance baselining and alerting then the cost is an agent or network tap. That said if everyone in our global company came at us wanting taps or agents on all segments and servers it would not be reasonable. If I had the ability to pull in the same DPI data and display it within Orion while integrated into the dynamic thresholds, alerting, and reporting from SDWAN without the need for packet capture then I would be all in.


Level 10

I think you hit the nail on the head. I cant name how many times monitoring statistics looked fine while we have recieved tickets for poor performance. With technology and architectures becoming more and more redundant simple fault management does not go far enough in staying proactive about IT services delivered to our clients. The era where users were more worried about being able to connect have left since the introduction of ubiqutous access through our cell phones. Its all about speed now.

UX monitoring is truely the future of monitoring. To do this effectively I believe it really needs to be seen from the application perspective moving across the network. You simpley cant just use the networks persepctive. Thats why I like the WPM/NPM combo. You need to have some method to measure the user experience regardless of metrics on the network and server. Then have a method for finding the anomolies when performance is not being delivered. Frankly I think the coolest thing in the world would be to see the entire transaction on WPM across each device. Basically see performance hop to hop. Then see the server turn around time. Finally, see the database call performance. Then the return trip back across the network.

I wish WPM could measure client application performance effectively...

Level 13

I believe QoE/IP SLA/whatever is getting more attention for multiple reasons.

First, IT really is about delivering a service to the end user so we need some way to monitor and verify that service delivery.

Second, from a network support viewpoing monitoring the user experience is necessary to stop those annoying "the network is slow" calls.

Level 20

I think the whole app-stack kind of display is at least getting us in the direction of being able to answer why "the network is slow" (often not the network at all... some kind of storage problem... or poorly written stored procedure in some database web app often instead)

In addition to the app stack we can always us SAM to actually do a test from some specific users workstations around the WAN and keeps stats over time in NPM.  (Granted depending on your applications this might involve some scripting for the UX monitor that runs and then is timed by SAM)

The new DPI agents can also evolve to really help with this... for me it's going to really help with getting some additional stats from inside DMZ's behind layer7 f/w and IPS that are small enough that they will never warrant their own NPM poller.  As long as I can get security to buy into that one encrypted tunnel port I'm good.

MVP
MVP

I think that regardless of what we monitor and stick on dashboards, we will always need the base level data and expose it to the monitoring system in order to diagnose an issue.

Appstack is a great start at enabing this kind of presentation. Seeing this as V1 and knowing how this stuff rapidly matures at Solarwinds, the future seems to be very exciting.

Level 9

That is very cool - I don't offer do that level of monitoring on the NMS itself. I wouldn't have thought of using QoE on the Orion server itself. I may have to 'steal' that idea for some future implementations.

Level 9

Yes, you're quite right that we will always need the base data to be able to diagnose the problems. I've found that collecting the user experience data (either through some form of synthetic transactions, or monitoring real traffic) helps with telling that there is a problem somewhere. Sometimes those investigations then lead you to learn that there's some other base-level metric you need to be collecting.

AppStack does look good. There's a lot of work to pull everything together, and to get useful information out of it. Hopefully they've got the right base pieces in place so they can quickly iterate on it.

Level 11

Ultimately user experience is really the only thing that matters. yes, we can monitor all the individual nodes and interfaces and such, but the true indicator is always going to be the user experience. This becomes more true if your apps have redundancy throughout the layers, including the network. At this point i care a lot lot less if one device out of a pool fails. Is the user experiencing issues? No? Then I can fix it at my leisure. The user doesn't give a care about a node or an interface or a device, they only care if they can do what they need to do. The detailed information is for us in the background that need to keep things running.

Visit any big service provider like a Microsoft or a Google or a Facebook and ask them if they care if a device goes down. If you spent time touring one of their container operations you'd see that there are many many failed nodes in the system. Sometimes they don't replace them for days because it doesn't matter because they have such a high level of redundancy. Ultimately its about the UX. Go into a small mom and pop shop and yes the individual components or nodes now become much more critical. At this level the UX is a lower concern because any failure results in an outage. Appstack is nice and all, and it helps with the troubleshooting aspect, but using a tool like WPM is much more indiciative of performance. Our app teams are getting more and more involved with our WPM because it is giving them quantifiable data of performance for a web app globally.

MVP
MVP

The UX aspect of monitoring is just part of the puzzle.

Yes you need it as well as the tools to get the complete picture.

Over time you can correlate certain patterns to relate to specific conditions.

Sometimes the first indicator of trouble is the user experience..sometimes it is hardware or something not as obvious in the network.

You continually need different view into the forest otherwise your view will always be blocked by the same trees....

Level 10

As IT is viewed by CIOs and CTOs as a service department with in their businesses UX becomes increasingly more important to them as well. The satisfaction in the services we deliver to our customers (whether external or internal) is going to become part of the key success indicators we are evaluated against. We must be willing to adapt and to adopt those technologies that allow us to see more than just a green/red view out of systems. Slow is down and user experience data is a important part to identifying the impact of a hardware/software failure or degradation.

I think there is a greater question and debate is about synthetic user data vs real user data.

Level 9

I figure that you need both. Synthetic transactions to provide a known baseline, and then real user data so that you can see what's going on across a wide range of end-users. Sometimes you don't have enough real user transactions to get meaningful data though - e.g. for low traffic sites, real-user stats can be skewed by a few outliers

About the Author
Lindsay is a network & security consultant based in New Zealand.