Showing results for 
Search instead for 
Did you mean: 

Network Performance Monitoring And Dynamic Workloads

Level 11

Network performance monitoring feels a bit like a moving target sometimes.  Just as we normalize processes and procedures for our monitoring platforms, some new technology comes around that turns things upside down again. The most recent change that seems to be forcing us to re-evaluate our monitoring platforms is cloud computing and dynamic workloads. Many years ago, a service lived on a single server, or multiple if it was really big. It may or may not have had redundant systems, but ultimately you could count on any traffic to/from that box to be related to that particular service.

That got turned on its head with the widespread adoption of virtualization. We started hosting many logical applications and services on one physical box. Network performance to and from that one server was no longer tied to a specific application, but generally speaking, these workloads remained in place unless something dramatic happened, so we had time to troubleshoot and remediate issues when they arose.

In comes the cloud computing model, DevOps, and the idea of an ephemeral workload. Rather than have one logical server (physical or virtual), large enough to handle peak workloads when they come up and highly underutilized otherwise, we are moving toward containerized applications that are horizontally scaled. This complicates things when we start looking at how to effectively monitor these environments.

So What Does This Mean For Network Performance Monitoring?

The old way of doing things simply will not work any longer. Assuming that a logical service can be directly associated with a piece of infrastructure is no longer possible. We’re going to have to create some new methods, as well as enhance some old ones, to extract the visibility we need out of the infrastructure.

What Might That Look Like?

Application Performance Monitoring

This is something that we do today and Solarwinds has an excellent suite of tools to make it happen. What needs to change is our perspective on the data that these tools are giving us. In our legacy environments, we could poll an application every few minutes because not a lot changes between polling intervals. In the new model of system infrastructure, we have to assume that the application is scaled horizontally behind load balancers and that poll only touched one of many deployed instances. Application polling and synthetic transactions will need to happen far more frequently to give us a broader picture of performance across all instances of that application.


Rather than relying on polling to tell us about new configurations/instances/deployments on the network, we need the infrastructure to tell our monitoring systems about changes directly. Push rather than pull works much better when changes happen often and may be transient. We see a simple version of this in syslog today, but we need far better-automated intelligence to help us correlate events across systems and analyze the data coming into the monitoring platform. This data then will need to be associated with our traditional polling infrastructure to understand the impact of a piece of infrastructure going down or misbehaving. This likely will also include heuristic analysis to determine baseline operations and variations from that baseline. Manually reading logs every morning isn’t going to cut it as we move forward.

Traditional Monitoring

This doesn’t go away just because we’ve complicated things with a new form of application deployment. We still will need to keep monitoring our infrastructure for up/down, throughput, errors/discards, CPU, etc.

Final Thoughts

Information Technology is an ever-changing field, so it makes sense that we’re going to have to adjust our methods over time. Some of these changes will be in how we implement the tools we have today, and some of them are going to require our vendors to give us better visibility into the infrastructure we’re deploying. Either way, these types of challenges are what makes this work so much fun.

Level 21

Great post and I think you are spot on about the model needing to change.  I recently had the opportunity to attend a Nike Tech talk with Charity Majors as the speaker and she was talking about this as well.  I thought she did brilliant job of noting how traditional monitoring is all configured to capture known failure modes and with the complexity and rate of change in modern environments that no longer works because we don't know what the failure modes are going to be.  Black swan events are becoming less the exception and more the normal so we need monitoring technology that can identify these black swan events and one major key to that all is as you suggested the Telemetry or Event driven analysis.  I would love to see more machine learning and/or heuristic analysis of this type of data to help us better monitor using this model.

Level 15

Arent we always in a state of flux with our monitoring?  New systems, applications, processes and then the need to monitor, manage, and track.
Level 16

Good article

Level 9

Great post jordan.martin

Level 9

Very knowledgeable post

Level 19

AppStack and PerfStack are getting pretty close to some of this now.

Level 15

Good article, thanks for posting.

Level 16

It is interesting that in y2017 that core monitoring components still gravitate to: Traps, Messages, and SNMP. WMI is great but it isn't without its flaws. I foresee the nextgen of true monitoring will be closer to UX. Breakthroughs like AppStack and NetPath are steps towards that.

Level 13

i wonder what the next high end monitoring solution will look like (from solarwinds, of course).

Level 21

Perhaps many are in industries and positions similar to mine, where the absolute minimum acceptable level of performance is your absolute best work.  No, I'm not an eye surgeon or a nuclear physicist controlling dangerous isotopes.  But I design, install, and maintain the network that lets them do their jobs.

Automation & monitoring, when done correctly, are a boon to everyone.

Regarding real-time push notifications from Telemetry (or other) systems, I sometimes see folks who don't set up Traps due to the complexity and their lack of understanding how it must be done, and what to avoid to keep your information flow useful and not simply overwhelming.

Between syslog and trap solutions, a lot of good can be achieved.  I'll be interested to learn what improves on them in the future, and I'll hope those new solutions will be secure, intuitive, painless to implement, and won't hog the resources of the devices or the servers or network bandwidth as they let us know what's happening.

Perhaps best will be automated solutions to replace the human interactions or reactions to the incoming information.  Where a problem occurs and is reported through syslog, trap, or some new solution, the systems to which that information is reported should be powerful enough to use automation to react correctly and remediate the issue in real time, rather than waiting for a human to be notified and to start digging into the issue and deciding how or whether to take action.

Level 15

Performance is not a single number, but a sum. For example if you look at the latency across the WAN connection you may get a good low number, but still have slow performance. The total latency includes the time a transaction takes on the machine, the time it takes to access data from a drive, the time it takes to display the data/fill forms/etc. and a whole host of things between the raw data where it lives and it's final destination to the user. That's where all the various tools come together.

Level 10

So, go acquire Dynatrace already.....  Before HP or someone who does not know what is going on grabs it and screws it all up.

It sure would work great in parallel with the SolarWinds current products.

I've looked and run others, but only those ruxit guys truly "get" the "low touch" requirement for monitoring instrumentation.  Especially from traditional operations staff.  All the others want a DevOps geek involved in getting the instrumentation setup.

Level 18

the joys of multiple instances behind a load need to poll each instance as well as through the load balancer not to mention montior what the load balancer is saying about that pool.  This gets to be fun as things are spun up and down depending on load.