Path Monitoring In A World Of Overlays

jordan.martin over 6 years ago 5 minute read time

I remember the simpler days. Back when our infrastructure all lived in one place, usually just in one room, and monitoring it could be as simple as walking in to see if everything looked OK. Today’s environments are very different, with our infrastructure being distributed all over the planet and much of it not even being something you can touch with your hands. So, with ever increasing levels of abstractions introduced by virtualization, cloud infrastructure, and overlays, how do you really know that everything you’re running is performing the way you need it to? In networking this can be a big challenge as we often solve technical challenges by abstracting the physical path from the routing and forwarding logic. Sometimes we do this multiple times, with overlays, existing within overlays, that all run over the same underlay. How do you maintain visibility when your network infrastructure is a bunch of abstractions? It’s definitely a difficult challenge but I have a few tips that should help if you find yourself in this situation.

Know Your Underlay - While all the fancy and interesting stuff is happening in the overlay, your underlay acts much like the foundation of a house. If it isn’t solid, there is no hope for everything built on top of it to run the way you want it to. Traditionally this has been done with polling and traps, but the networking world is evolving, and newer systems are enabling real-time information gathering (streaming telemetry). Collecting both old and new styles of telemetry information and looking for anomalies will give you a picture of the performance of the individual components that comprise your physical infrastructure. Problems in the underlay effect everything so this should be the first step you take, and the one your most likely familiar with, to ensure your operations run smoothly.

Monitor Reality - Polling and traps are good tools, but they don’t tell us everything we really need to know. Discarded frames and interface errors may give us concrete proof of an issue, but they give no context to how that issue is impacting the services running on your network. Additionally, with more and more services moving to IaaS and SaaS, you don’t necessarily have access to operational data on third party devices. Synthetic transactions are the key here. While it may sound obvious to server administrators, it might be a bit foreign for network practitioners. Monitor the very things your users are trying to do. Are you supporting a web application? Regularly send an HTTP request to the site and measure response time to completion. Measure the amount of data that is returned. Look for web server status codes and anomalies in that transaction. Do the same for database systems, and collaboration systems, and file servers… You get the idea. This is the proverbial canary in a coal mine and what lets you know something is up before the users end up standing next to your desk. The reality is that network problems ultimately manifest themselves as system issues to the end users, so you can’t ignore this component of your network.

Numbers Can Lie - One of the unwritten rules of visibility tools is to use the IP address, not the DNS name, in setting up pollers and monitoring. I mean, we’re networkers, right? IP addresses are the real source of truth when it comes to path selection and performance. While there is some level of wisdom to this, it omits part of the bigger picture and can lead you astray. Administrators may regularly use IP addresses to connect to and utilize the systems we run, but that is rarely true for our users, and DNS often is a contributing cause to outages and performance issues. Speaking again to services that reside far outside of our physical premises, the DNS picture can get even more complicated depending on the perspective and path you are using to access those services. Keep that in mind and use synthetic transactions to query your significant name entries, but also set up some pollers that use the DNS system to resolve the address of target hosts to ensure both name resolution and direct IP traffic are seeing similar performance characteristics.

Perspective Matters - It’s always been true, but where you test from is often just as important as what you test. Traditionally our polling systems are centrally located and close to the things they monitor. Proverbially they act as the administrator walking into the room to check on things, except they just live there all the time. This design makes a lot of sense in a hub style design, where many offices may come back to a handful of regional hubs for computing resources. But, yet again, cloud infrastructure is changing this in a big way. Many organizations offload Internet traffic at the branch, meaning access to some of your resources may be happening over the Internet and some may be happening over your WAN. If this is the case it makes way more sense to be monitoring from the user’s perspective, rather than from your data centers. There are some neat tools out there to place small and inexpensive sensors all over your network, giving you the opportunity to see the network through many different perspectives and giving you a broader view on network performance.

Final Thoughts

While the tactics and targets may change over time, the same rules that have always applied, still apply. Our visibility systems are only as good as the foresight we have into what could possibly go wrong. With virtualization, cloud, and abstraction playing larger roles in our network, it’s more important than ever to have a clear picture of what it is in our infrastructure that we should be looking for. Abstraction reduces the complexity presented to the end user but, in the end, all it is doing is hiding the complexity that’s always existed. Typically, abstraction actually increases overall system complexity. And as our networks and system become ever more complex, it takes more thought and insight into “how things really work” to make sure we are looking for the right things, in the right places, to confidently know the state of the systems we are responsible for.

Top Comments

sja over 6 years ago +1

IF VNQM wil get better attention.... There are many good use cases for user experience metrics... open standart IEEE 802.1ag - Wikipedia ITU-T Y.1564 - Wikipedia TWAMP RFC2544 RFC 6349 - TCP Throughput…
zennifer over 6 years ago +1

I have come back to your article 3 times now jordan.martin There are a couple of objectives that I am attempting to accomplish in my current job. I continue to reference monitor reality. I am adding…

byrona over 6 years ago

While nothing new here, I think this does a great job building a foundation to start with. As I just mentioned in another post, we need to build our networks with a great integrated toolset to provide us with the necessary visibility to manage and audit on a continuous basis. I think it's a good exercise to talk thought and document all the different data-points you want to have on your network/services. Talk through different failure modes and what data you will want to identify, troubleshoot and perform root cause analysis. Use this documentation as your requirements list for your monitoring/tools infrastrucutre.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
zennifer over 6 years ago

I have come back to your article 3 times now jordan.martin
There are a couple of objectives that I am attempting to accomplish in my current job. I continue to reference monitor reality. I am adding your document to my repository.
I am doing a demo with Dark Trace right now and I really need this tool; seems like the last complement to the network right now. I need governance, security; you can properly form documentation when you know what is going on.
Thanks for the article jordan.martin you have made an impression! I appreciate you taking the time and effort to make this post!
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
tallyrich over 6 years ago

Agreed.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
Jfrazier over 6 years ago

Even though you start with an initial baseline, things change and you need to reset the baseline from time to time.
What is the new norm ? Then you can look at how the norms have changed over time which helps you with scalability over time.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
tallyrich over 6 years ago

Two things that so often get "missed"
Baselines
Documentation
Without baselines how do you know if a system/network/application or anything else is performing better or worse. Without baselines we are left to reactive rather than proactive response.
Documentation - get it in writing, or a spreadsheet or chiseled in stone or something, anything just get it documented. So often our baselines are "Well, when I started here that was a clear 10 milliseconds"
With all that make sure that it all makes sense. Just as the article stated there are various aspects and data is just that data - it's meaningless until uses and/or evaluated.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel