cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Decentralized Monitoring: Agent/Responder Meshes

Level 11

Last week, we talked about monitoring the network from different perspectives. By looking at how applications perform from different points in the network, we get an approximation of the users' experience. Unfortunately, most of those tools are short on the details surrounding why there's a problem or are limited in what they can test.

On one end of our monitoring spectrum, we have traditional device-level monitoring. This is going to tell us everything we need to know that is device-specific. On the other end, we have the application-level monitoring discussed in the last couple of weeks. Here, we're going to approximate a view of how the end users see their applications performing. The former gives us a hardware perspective and the latter gives us a user perspective. Finding the perspective of the network as a whole is still somewhere between.

Using testing agents and responders on the network at varying levels can provide that intermediate view. They allow us to test against all manner of traffic, factoring in network latency and variances (jitter) in the same.

Agents and Responders

Most enterprise network devices have built-in functions for initiating and responding to test traffic. These allow us to test and report on the latency of each link from the device itself. Cisco and Huawei have IP Service Level Agreement (SLA) processes. Juniper has Real-Time Performance Monitoring (RPM) and HPE has its Network Quality Analyzer (NQA) functions, just to list a few examples. Once configured, we can read the data from them via Simple Network Management Protocol (SNMP) and track their health from our favourite network monitoring console.

Should we be in the position of having an all-Cisco shop, we can have a look at SolarWinds' IP SLA Monitor and VoIP and Network Quality Manager products to simplify setting things. Otherwise, we're looking at a more manual process if our vendor doesn't have something similar.

Levels

Observing test performance at different levels gives us reports of different granularity. By running tests at the organization, site and link levels, we can start with the bigger picture's metrics and work our way down to specific problems.

Organization

Most of these will be installed at the edge devices or close to them. They will perform edge-to-edge tests against a device at the destination organization or cloud hosting provider. There shouldn't be too many of these tests configured.

Site

Site-to-site tests will be configured close to the WAN links and will monitor overall connectivity between sites. The point of these tests is to give a general perspective on intersite traffic, so they shouldn't be installed directly on the WAN links. Depending on our organization, there could be none of these or a large number.

Link

Each network device has a test for each of its routed links to other network devices to measure latency. This is where the largest number of tests are configured, but is also where we are going to find the most detail.

Caveats

Agent and responder testing isn't passive. There's always the potential for unwanted problems caused by implementing the tests themselves.

Traffic

Agent and responder tests introduce traffic to the network for purposes of testing. While that traffic shouldn't be significant enough to cause impact, there's always the possibility that it will. We need to keep an eye on the interfaces and queues to be sure that there isn't any significant change.

Frequency and Impact

Running agents and responders on the network devices themselves are going to generate additional CPU cycles. Network devices as a whole are not known for having a lot of processing capacity. So, the frequency for running these tests may need to be adjusted to factor that in.

Processing Delay

Related to the previous paragraph, most networking devices aren't going to be performing these tests quickly. The results from these tests may require a bit of a "fudge factor" at the analysis stage to account for this.

The Whisper in the Wires

Having a mesh of agents and responders at the different levels can provide point-in-time analysis of latencies and soft failures throughout the network. But, it needs to be managed carefully to avoid having negative impacts to the network itself.

Thanks to Thwack MVP byrona​ for spurring some of my thinking on this topic.

Is anyone else building something along these lines?

14 Comments
MVP
MVP

I missed your other article, but love what you're talking about here - off to find the first one now

Building that sounds like it can provide more granular information.  But it's a very large task, and could be extended to an almost absurd solution of creating a monitor and synthetic testing client for every device on the network.  Hopefully that won't ever become necessary, but I can see where folks might wish for it.  Especially if it were affordable and had a very small footprint--a tiny monitoring client on every device, or a virtual monitor assigned to every switchport with active link.

Uffda--that thought makes me roll my eyes at the complexity of it all.

Level 11

There's no question that doing this at scale would require automation. The idea of managing this manually on anything more than a small network would be terrifying.

MVP
MVP

Yes the layers help in this...but you need to view some things from different points in your environment to get the full picture.

A simple example is a website...from the outside it looks like a single server.

Under the covers there is an F5 that load balances the website across 6 servers.

You start getting customers calling that they are having problems getting to the app.  But it is not the volume of calls you would get if say the F5 was down.  You have monitoring outside looking in to verify the website is up and running.  Every now and then it gets a failure.  Inside you have monitors for the same website but pointed at each of the 6 hosting servers. 

From this you can tell the app service as a whole is up and running.  You can tell the F5 is doing what it is supposed to do.  But the individual server monitors tell you which server is having a problem.  Now granted the F5 could possibly do the same if it recognizes an issue.

This is the standard forest and the trees event.  You have to be looking at things from different perspectives....but that always comes with a cost.

resources

It takes resources to accomplish this so that has to be kept in mind.

MVP
MVP

The good news is you can configure F5's to monitor the servers and take one or more out of the pool if it stops responding.  I love F5's... I wish my current company used more of them.

MVP
MVP

IPSLA is promising.  I honestly haven't done a lot with it myself but our network team has done some and we are doing QOS now.

Level 21

One thing that is on the top of my mind is that as clout technologies grow and more use of the hyper-scalers (Azure, AWS, etc) become more common and technologies like micro-services and containers also become more common how does that change our monitoring to become more "outside in" such as this?

With these types of cloud technologies you don't have access to the infrastructure as we classically have which creates a dilemma for monitoring which has classically been focused on and relied on that infrastructure level.  How do the monitoring tools need to evolve and our overall monitoring strategies need to evolve to still ensure the same types of SLA's?

Level 11

Container-based agents are a promising angle here. They're easily deployed in the cloud and can provide end-to-end connectivity testing. If something like NetPath is added to the recipe, we can get a fair picture of the intermediate issues between our edge and the cloud service.

Level 21

ghostinthenet​ I had not heard of container-based agents, thanks for sharing that.  I certainly hope SolarWinds is working on that for Orion.

Just last week I was testing how to monitor PaaS based SQL in Azure and unfortunately I was not able to get the level of data I was hoping to.

Level 11

byrona​ Now that Docker is a baked-in component of Windows Server 2016, it may be worthwhile trying to roll a SolarWinds agent into that. Once successful, it may be just a step to the right to deploy it on Azure. Have a look at "Windows Server containers on Azure Container Service private preview | Blog | Microsoft Azure​" for a promising angle.

Level 21

ghostinthenet​ ah, that makes sense.  So then this brings me to my next question:

My understanding is that in many cases containers are designed to only exist for a very short period of time.  Assuming that to be true, we need a way for them to be monitored in a way that doesn't take any human interaction.  That or just monitor the overall service provided by those containers and don't even bother with the containers themselves.

Level 11

byrona​ It's not so much the length of time that they're designed for. They're ephemeral, so they're designed to be lightweight, easily built and torn down, and typically don't have local storage. (That last bit is starting to change, considering that the other advantages of containers are pushing their adoption in more permanent situations.) But... they can run as long as you need them to. In other words, they're perfect for an agent that pushes all of its data to a central console and keeps nothing locally.

Level 21

ghostinthenet​ again, thanks for this information.  The only things I really know at this point about containers is based on what I have read, I have not practical experience.  I have recently been working with some Azure based PaaS services which have been a challenge.

This all got me to thinking about the need for a Zero Touch Deployment option for nodes in Orion which prompted me to open the feature request.  As we move to bulk node deployment and deletion to support more cloud based infrastructures taking the time to individually configure each node in the monitoring system is going to be time prohibitive.

Level 11

Hm... unfortunately byrona, Thwack only allows me to hit "Like" once per comment. That needs at least five.

This definitely takes us back to the need to automate larger deployments.

About the Author
Network Greasemonkey, Packet Macrame Specialist, Virtual Pneumatic Tube Transport Designer and Connectivity Nerfherder. The possible titles are too many to count, but they don’t really mean much when I’m essentially a hired gun in the wild west that is modern networking. I’m based in the Niagara region of Ontario, Canada and operate tishco networks, a consulting firm specializing in the wholesale provisioning of networking services to IT firms for resale to their respective clientele. Over my career, I have developed a track record designing and deploying a wide variety of successful networking solutions in areas of routing, switching, data security, unified communications and wireless networking. These range from simple networks for small-to-medium business clients with limited budgets to large infrastructure VPN deployments with over 450 endpoints. My broad experience with converged networks throughout Canada and the world have helped answer many complex requirements with elegant, sustainable and scalable solutions. In addition, I maintain current Cisco CCDP and CCIE R&S (41436) certifications. I tweet at @ghostinthenet, am a Tech Field Day delegate, render occasional pro-bono assistance on sites like the Cisco Support Community and Experts' Exchange and occasionally rant publicly on my experiences by "limpet blogging" on various sites. Outside of the realm of IT, I am both a husband and father. In what meagre time remains, I contribute to my community by serving as an RCAF Reserve Officer, supporting my local squadron of the Royal Canadian Air Cadets as their Commanding Officer.