Network performance monitoring feels a bit like a moving target sometimes.  Just as we normalize processes and procedures for our monitoring platforms, some new technology comes around that turns things upside down again. The most recent change that seems to be forcing us to re-evaluate our monitoring platforms is cloud computing and dynamic workloads. Many years ago, a service lived on a single server, or multiple if it was really big. It may or may not have had redundant systems, but ultimately you could count on any traffic to/from that box to be related to that particular service.

 

That got turned on its head with the widespread adoption of virtualization. We started hosting many logical applications and services on one physical box. Network performance to and from that one server was no longer tied to a specific application, but generally speaking, these workloads remained in place unless something dramatic happened, so we had time to troubleshoot and remediate issues when they arose.

 

In comes the cloud computing model, DevOps, and the idea of an ephemeral workload. Rather than have one logical server (physical or virtual), large enough to handle peak workloads when they come up and highly underutilized otherwise, we are moving toward containerized applications that are horizontally scaled. This complicates things when we start looking at how to effectively monitor these environments.

 

So What Does This Mean For Network Performance Monitoring?

 

The old way of doing things simply will not work any longer. Assuming that a logical service can be directly associated with a piece of infrastructure is no longer possible. We’re going to have to create some new methods, as well as enhance some old ones, to extract the visibility we need out of the infrastructure.

 

What Might That Look Like?

 

Application Performance Monitoring

This is something that we do today and Solarwinds has an excellent suite of tools to make it happen. What needs to change is our perspective on the data that these tools are giving us. In our legacy environments, we could poll an application every few minutes because not a lot changes between polling intervals. In the new model of system infrastructure, we have to assume that the application is scaled horizontally behind load balancers and that poll only touched one of many deployed instances. Application polling and synthetic transactions will need to happen far more frequently to give us a broader picture of performance across all instances of that application.

 

Telemetry

Rather than relying on polling to tell us about new configurations/instances/deployments on the network, we need the infrastructure to tell our monitoring systems about changes directly. Push rather than pull works much better when changes happen often and may be transient. We see a simple version of this in syslog today, but we need far better-automated intelligence to help us correlate events across systems and analyze the data coming into the monitoring platform. This data then will need to be associated with our traditional polling infrastructure to understand the impact of a piece of infrastructure going down or misbehaving. This likely will also include heuristic analysis to determine baseline operations and variations from that baseline. Manually reading logs every morning isn’t going to cut it as we move forward.

 

Traditional Monitoring

This doesn’t go away just because we’ve complicated things with a new form of application deployment. We still will need to keep monitoring our infrastructure for up/down, throughput, errors/discards, CPU, etc.

 

Final Thoughts

Information Technology is an ever-changing field, so it makes sense that we’re going to have to adjust our methods over time. Some of these changes will be in how we implement the tools we have today, and some of them are going to require our vendors to give us better visibility into the infrastructure we’re deploying. Either way, these types of challenges are what makes this work so much fun.