cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Path Monitoring In A World Of Overlays

Level 11

I remember the simpler days. Back when our infrastructure all lived in one place, usually just in one room, and monitoring it could be as simple as walking in to see if everything looked OK. Today’s environments are very different, with our infrastructure being distributed all over the planet and much of it not even being something you can touch with your hands. So, with ever increasing levels of abstractions introduced by virtualization, cloud infrastructure, and overlays, how do you really know that everything you’re running is performing the way you need it to? In networking this can be a big challenge as we often solve technical challenges by abstracting the physical path from the routing and forwarding logic. Sometimes we do this multiple times, with overlays, existing within overlays, that all run over the same underlay. How do you maintain visibility when your network infrastructure is a bunch of abstractions?  It’s definitely a difficult challenge but I have a few tips that should help if you find yourself in this situation.

Know Your Underlay - While all the fancy and interesting stuff is happening in the overlay, your underlay acts much like the foundation of a house. If it isn’t solid, there is no hope for everything built on top of it to run the way you want it to. Traditionally this has been done with polling and traps, but the networking world is evolving, and newer systems are enabling real-time information gathering (streaming telemetry). Collecting both old and new styles of telemetry information and looking for anomalies will give you a picture of the performance of the individual components that comprise your physical infrastructure. Problems in the underlay effect everything so this should be the first step you take, and the one your most likely familiar with, to ensure your operations run smoothly.

Monitor Reality - Polling and traps are good tools, but they don’t tell us everything we really need to know. Discarded frames and interface errors may give us concrete proof of an issue, but they give no context to how that issue is impacting the services running on your network. Additionally, with more and more services moving to IaaS and SaaS, you don’t necessarily have access to operational data on third party devices. Synthetic transactions are the key here. While it may sound obvious to server administrators, it might be a bit foreign for network practitioners. Monitor the very things your users are trying to do. Are you supporting a web application?  Regularly send an HTTP request to the site and measure response time to completion. Measure the amount of data that is returned. Look for web server status codes and anomalies in that transaction. Do the same for database systems, and collaboration systems, and file servers… You get the idea. This is the proverbial canary in a coal mine and what lets you know something is up before the users end up standing next to your desk. The reality is that network problems ultimately manifest themselves as system issues to the end users, so you can’t ignore this component of your network.

Numbers Can Lie - One of the unwritten rules of visibility tools is to use the IP address, not the DNS name, in setting up pollers and monitoring. I mean, we’re networkers, right? IP addresses are the real source of truth when it comes to path selection and performance. While there is some level of wisdom to this, it omits part of the bigger picture and can lead you astray. Administrators may regularly use IP addresses to connect to and utilize the systems we run, but that is rarely true for our users, and DNS often is a contributing cause to outages and performance issues. Speaking again to services that reside far outside of our physical premises, the DNS picture can get even more complicated depending on the perspective and path you are using to access those services. Keep that in mind and use synthetic transactions to query your significant name entries, but also set up some pollers that use the DNS system to resolve the address of target hosts to ensure both name resolution and direct IP traffic are seeing similar performance characteristics.

Perspective Matters - It’s always been true, but where you test from is often just as important as what you test. Traditionally our polling systems are centrally located and close to the things they monitor. Proverbially they act as the administrator walking into the room to check on things, except they just live there all the time. This design makes a lot of sense in a hub style design, where many offices may come back to a handful of regional hubs for computing resources. But, yet again, cloud infrastructure is changing this in a big way. Many organizations offload Internet traffic at the branch, meaning access to some of your resources may be happening over the Internet and some may be happening over your WAN. If this is the case it makes way more sense to be monitoring from the user’s perspective, rather than from your data centers. There are some neat tools out there to place small and inexpensive sensors all over your network, giving you the opportunity to see the network through many different perspectives and giving you a broader view on network performance.

Final Thoughts

While the tactics and targets may change over time, the same rules that have always applied, still apply. Our visibility systems are only as good as the foresight we have into what could possibly go wrong. With virtualization, cloud, and abstraction playing larger roles in our network, it’s more important than ever to have a clear picture of what it is in our infrastructure that we should be looking for. Abstraction reduces the complexity presented to the end user but, in the end, all it is doing is hiding the complexity that’s always existed. Typically, abstraction actually increases overall system complexity. And as our networks and system become ever more complex, it takes more thought and insight into “how things really work” to make sure we are looking for the right things, in the right places, to confidently know the state of the systems we are responsible for.

15 Comments

I think a key point to this article, and this really isn't new, is that there are multiple parties involved. Usually in the form of vendors. So when things go bump in the night and you as the customer aren't sure where the problem lies then it is on you to pin the right one down and hold them responsible for fixing.

You had me in the first paragraph where you wrote " . . . as simple as walking in to see if everything looked OK."

I recall a time when a site's WAN communications failed, and I happened to know another department's IT member was on-site.  I called his cell and asked him to look in on the network room--did everything look and sound and smell normal?

He went into the room, stood in front of the switches, and reported back calmly:  "Yes, everything's up and running.  The green lights on the switches are flashing, no red lights, nothing out of the ordinary."

I thanked him and then went on to troubleshooting other aspects, since the switch (which was a Layer 3 collapsed Distribution/Access switch) was "obviously OK." 

When I was done with Occam's Razor I got in the car and drove to the site to see for myself.  I stood in the same spot as my coworker and interpreted things quiet differently.  Yes, the green lights were blinking on the 392-port chassis switch.  But they ALL BLINKED ON AND OFF SIMULTANEOUSLY. 

Folks with switch/router experience will realize that this switch could no longer see the WAN, that its devices were broadcasting like crazy trying to find a path to the rest of the network.

A quick console session into the chassis revealed the problem, and bouncing a WAN port that had become disabled due to a broadcast storm on the VPLS environment brought the site back up.

Seeing isn't understanding until you have the training and experience to correctly interpret what is seen.

MVP
MVP

Ah...knowing what is normal is key to knowing what is abnormal not to mention perspective (point of view) based on knowledge and experience.

Level 13

good to be reminded of the basics sometimes.

MVP
MVP

Nice article

Level 20

This is part of the reason why in addition to Orion we're buying into Riverbed to dive deeper into packets.

Level 16

IF VNQM wil get better attention....

There are many good use cases for user experience metrics...

open standart

IEEE 802.1ag - Wikipedia

ITU-T Y.1564 - Wikipedia

TWAMP

RFC2544

RFC 6349 - TCP Throughput Testing Methodology

Vendors proprietary

Juniper RPM

Level 13

Good info

MVP
MVP

I could have not said that any better!!!  

I am totally about the lights, feel, smell!!! They should add that to the OSI model!!

You have to be in tune with your equipment, very easy to say that all the lights are on... simultaneously blinking .. WOW ... that was not good!

Thanks for sharing rschroeder Expert and...  experience is knowing what is normal !!!  Jfrazier

MVP
MVP

Thanks ... just voted!

MVP
MVP

Two things that so often get "missed"

Baselines

Documentation

Without baselines how do you know if a system/network/application or anything else is performing better or worse. Without baselines we are left to reactive rather than proactive response.

Documentation - get it in writing, or a spreadsheet or chiseled in stone or something, anything just get it documented. So often our baselines are "Well, when I started here that was a clear 10 milliseconds"

With all that make sure that it all makes sense. Just as the article stated there are various aspects and data is just that data - it's meaningless until uses and/or evaluated.

MVP
MVP

Even though you start with an initial baseline, things change and you need to reset the baseline from time to time.
What is the new norm ?  Then you can look at how the norms have changed over time which helps you with scalability over time.

MVP
MVP

Agreed.

MVP
MVP

I have come back to your article 3 times now jordan.martin

There are a couple of objectives that I am attempting to accomplish in my current job.  I continue to reference monitor reality.  I am adding your document to my repository. 

     I am doing a demo with Dark Trace right now and I really need this tool; seems like the last complement to the network right now.  I need governance, security; you can properly form documentation when you know what is going on.

Thanks for the article  jordan.martin  you have made an impression!  I appreciate you taking the time and effort to make this post!

Level 21

While nothing new here, I think this does a great job building a foundation to start with.  As I just mentioned in another post, we need to build our networks with a great integrated toolset to provide us with the necessary visibility to manage and audit on a continuous basis.  I think it's a good exercise to talk thought and document all the different data-points you want to have on your network/services.  Talk through different failure modes and what data you will want to identify, troubleshoot and perform root cause analysis.  Use this documentation as your requirements list for your monitoring/tools infrastrucutre.