Last week, we talked about monitoring the network from different perspectives. By looking at how applications perform from different points in the network, we get an approximation of the users' experience. Unfortunately, most of those tools are short on the details surrounding why there's a problem or are limited in what they can test.

On one end of our monitoring spectrum, we have traditional device-level monitoring. This is going to tell us everything we need to know that is device-specific. On the other end, we have the application-level monitoring discussed in the last couple of weeks. Here, we're going to approximate a view of how the end users see their applications performing. The former gives us a hardware perspective and the latter gives us a user perspective. Finding the perspective of the network as a whole is still somewhere between.

Using testing agents and responders on the network at varying levels can provide that intermediate view. They allow us to test against all manner of traffic, factoring in network latency and variances (jitter) in the same.

Agents and Responders

Most enterprise network devices have built-in functions for initiating and responding to test traffic. These allow us to test and report on the latency of each link from the device itself. Cisco and Huawei have IP Service Level Agreement (SLA) processes. Juniper has Real-Time Performance Monitoring (RPM) and HPE has its Network Quality Analyzer (NQA) functions, just to list a few examples. Once configured, we can read the data from them via Simple Network Management Protocol (SNMP) and track their health from our favourite network monitoring console.

Should we be in the position of having an all-Cisco shop, we can have a look at SolarWinds' IP SLA Monitor and VoIP and Network Quality Manager products to simplify setting things. Otherwise, we're looking at a more manual process if our vendor doesn't have something similar.

Levels

Observing test performance at different levels gives us reports of different granularity. By running tests at the organization, site and link levels, we can start with the bigger picture's metrics and work our way down to specific problems.

Organization

Most of these will be installed at the edge devices or close to them. They will perform edge-to-edge tests against a device at the destination organization or cloud hosting provider. There shouldn't be too many of these tests configured.

Site

Site-to-site tests will be configured close to the WAN links and will monitor overall connectivity between sites. The point of these tests is to give a general perspective on intersite traffic, so they shouldn't be installed directly on the WAN links. Depending on our organization, there could be none of these or a large number.

Link

Each network device has a test for each of its routed links to other network devices to measure latency. This is where the largest number of tests are configured, but is also where we are going to find the most detail.

Caveats

Agent and responder testing isn't passive. There's always the potential for unwanted problems caused by implementing the tests themselves.

Traffic

Agent and responder tests introduce traffic to the network for purposes of testing. While that traffic shouldn't be significant enough to cause impact, there's always the possibility that it will. We need to keep an eye on the interfaces and queues to be sure that there isn't any significant change.

Frequency and Impact

Running agents and responders on the network devices themselves are going to generate additional CPU cycles. Network devices as a whole are not known for having a lot of processing capacity. So, the frequency for running these tests may need to be adjusted to factor that in.

Processing Delay

Related to the previous paragraph, most networking devices aren't going to be performing these tests quickly. The results from these tests may require a bit of a "fudge factor" at the analysis stage to account for this.

The Whisper in the Wires

Having a mesh of agents and responders at the different levels can provide point-in-time analysis of latencies and soft failures throughout the network. But, it needs to be managed carefully to avoid having negative impacts to the network itself.

Thanks to Thwack MVP byrona for spurring some of my thinking on this topic.

Is anyone else building something along these lines?