1 2 Previous Next

Geek Speak

16 Posts authored by: ghostinthenet

I really want Software Defined Networking (SDN), or something like it, to be the go-to approach for networking, but are we too tied to our idea of what SDN is for us to get there?


The Definition


Almost ten years ago in 2009, Kate Green coined the term Software-Defined Networking in an article describing the newly-created OpenFlow specification that would be released later that year. The idea was revolutionary: Decouple the forwarding plane from the control plane and move the latter to a centralized controller. The controller would then manage the forwarding plane of the individual devices in the network from a global perspective. This would allow the entire network to be managed via a single interface to the controller. For some time following this, SDN became synonymous with OpenFlow, but the philosophy has exceeded the implementation.


A Cloud Technology?


In an admittedly questionable Wikipedia page, SDN is defined as "an approach to cloud computing that facilitates network management and enables programmatically efficient network configuration in order to improve network performance and monitoring." This is an interesting perspective, considering that OpenFlow appears to have been developed with large service provider networks in mind. So where does it go from being a service provider technology to a cloud technology? Large service providers and cloud (particularly public cloud) providers have one thing in common: scale.


In previous articles, I've discussed network automation in the cloud as a requirement rather than a desired state. Arguably, large networks of any sort share this property. When working at scale, there really isn't any other way to do things effectively.


This, of course, doesn't mean that the approach isn't desirable outside of large-scale environments. Still, need drives progress and the market focuses on the need.


Silo Busting


Since I began my career in networking (too) many years ago, technologies were placed in seemingly arbitrary categories and vendors tended to develop equipment with feature sets that followed these silos. Invariably, there's bleed from one category to another when new requirements surface. So why are we maintaining these categories in the first place? Networking is networking. If the solution for an enterprise business requirement is traditionally a data centre networking or service provider networking technology, use it.


For many years the IS-IS routing protocol was considered a service provider technology. Now, with its ability to handle IPv4 and IPv6 under a single routing architecture, it's getting a resurgence in the enterprise.


MPLS VPNs have mostly been in the service provider category, but are becoming seen in enterprise networks for organizations that need to support franchise network connectivity over the parent organization's network.


Shortest Path Bridging (SPB) was developed as a data centre networking technology, but is arguably an ideal replacement for Spanning Tree Protocol (STP) in general.


We need to think beyond the silos and look at networking as networking if we're going to escape the current state of micromanaging equipment. This means bringing SDN out of the cloud and service provider categories.


Delegation of Control


One of the key concerns about SDN that I've heard over the years is the problem of relying on a controller (or cluster of controllers) to make forwarding decisions. This approach is really good for standard routing and network functions that can be addressed globally. It falls down a bit when it comes to things like security policies at the edge, policy-based routing, and other exception-based items that are device-centric rather than network-centric.


Can we have an SDN architecture where the control plane is still distributed, but managed at the controller? Is it still SDN? The purists may argue, but in the same vein as the silos above, it doesn't really matter. We may need another term for it, but SDN can work for now, and here's why.


An Imperfect Dream


When I first considered writing this article, I was running under the working title of "When SDN Isn't" because I was frustrated with the number of solutions that purported to be SDN, but really weren't for various reasons. Some of them did not centralize the control plane under a controller. Others didn't provide open northbound APIs into their controller. Now I'm starting to think it's time to expand the practical definition a bit.


At its core, SDN works by allowing software to define requirements to the controller via a northbound API. The controller then programs the component devices or virtual devices via a southbound API. Taking the actual term Software Defined Networking literally, these are the key requirements.


If the component devices are programmed at the flow level by a controller that has the entire control plane centralized, and it meets the needs of the organization, that's awesome. If those devices have their own control planes and their decision making is defined at a higher level by the controller, that's just great too, again, so long as it meets the needs of the organization.


The Whisper in the Wires


SDN, or a relaxed definition of it, has the potential to be the holy grail of networking in general, but we're still stuck thinking in networking silos: cloud, data centre, service provider, enterprise, small/medium business, etc. What we want is a central and programmable interface to the entire network and to stop micromanaging devices. How that is accomplished below the controller level should be immaterial.

The Dream of the Data Center


For me, it started with OpenStack. I was at a conference a number of years ago listening to Shannon McFarland talking about using OpenStack to bring programmatic Network Functions Virtualization (NFV) into the data center. My own applications are much more small-scale, but the idea was captivating from the beginning. Since then, many other approaches to this have come into play, but all of them share that single idea of programmatic control over large-scale installations.


The thing about data center architectures is that they need automation. It's not an optional thing or a nice-to-have item. The human resources required to maintain those systems the way most network engineers maintain our networks just don't make financial sense. They're not all that efficient in smaller networks and are particularly ineffective at scale. Necessity is the mother of invention and that is what built the network automation infrastructure we see in large-scale DC deployments.


For smaller deployments, automation makes things easier and takes the drudge work out of the job. It's something we want, but not something we can always justify. Still, a guy can dream.


Automation at the Device Level (NETCONF/YANG)


Meanwhile, back in the real world of smaller networks and device-centric configurations, we're trying to make things easier as best we can. We've got NETCONF interfaces for programmatic control, and YANG models to use as templates for how things should be. Some of us are using tools like Ansible and SaltStack to go beyond device-by-device configurations, but we're still focused on the devices.


I'm not sure if this is due to the unwillingness of network engineers to change our paradigm of thinking from the devices to the network as a whole, or if it's the vendors creating equipment that interacts with the network only from its own perspective. It may well be that each feeds the other, creating a vicious cycle that's difficult to break.


If the necessity isn't there, where's the need for invention?


Commoditization and Virtualization (NFV)


As virtual machine technology began to become more common in smaller enterprises, the option of virtualizing all of the things became more appealing. If we're saving money and making more efficient use of resources by virtualizing server loads, why wouldn't we consider virtualizing some of our network infrastructures, too?


With Network Functions Virtualization, we came full circle to the dream that began with that OpenStack presentation. If the network, or at least portions of it, could be addressed programmatically like the other virtual machines, we were getting closer.


Were we dreaming too small?


Systemic Networking (SDN)


Even with NFV and the ability to use cloud and DC automation tools to provision and configure our virtual routers and switches, we're still being traditional network engineering greybeards and thinking in terms of devices rather than in terms of the entire network.


Enter Software Defined Networking, where we theoretically see the network as a programmable whole. The virtual components and the physical components share a single southbound API from a set of central controllers and the whole thing can be programmed through there.


Of course, depending on whose definition of SDN our products are working with, this may or may not be a complete solution, but that's a topic for another article.


Once this becomes commoditized, we theoretically have all of the tools to automate the network from a holistic perspective, but do we have an automation framework that will work equally well for all of the components in the platform?


The Whisper in the Wires


We have what it takes to virtualize and automate most of the network, making automation via central controllers a workable option. We can use one framework to deploy, provision, and automate the lot, right? Here's where I'm not quite sure. Even if we have a good strategy for our NFV devices and/or SDN controllers and their satellite devices, do we have a single framework that we can use to handle the deployment and management of the lot?

When I first began this post, my thinking revolved around the translation problems of a declarative approach to network operation, but having only procedural interfaces to work with. Further thought dismissed that assumption and led to thinking about how declarative models evolve to meet organizational needs.


I Declare!


Many articles have been written about how declarative models define what we want, while procedural models focus on how we want to accomplish this. Overall network policy should be declarative, while low-level device management should be procedural, especially when dealing with older platforms, right? Well, it's almost right.


Network device configuration and management have, for the most part, been declarative for some time. We don't tell switches how spanning tree works, nor how to put VLAN tags on Ethernet frames. We plug in the necessary parameters to tune these functions to our specific needs, but the switch already knows how to do these things. Even old-fashioned device-level configurations are declarative operations. Admittedly, they're not high-level functions and need some tweaking and babysitting, but they're still declarative.


Let's explore the timeline of where we've been, where we are, and where we're potentially going.


Procedural Development


Back in the primordial days of computing (okay, perhaps not that far back), when I was first learning to code, everything was procedural. We joked about the level of detail required to accomplish the simplest of things. What we really needed was the DWIT (Do What I'm Thinking) instruction. Alas, this particular directive still hasn't made it into any current coding environments. Developers have accepted this for the most part, and the people who are making the decisions are usually happy to let them handle it rather than trying to figure out how it all works at a low level. The development team became the buffer between the desired state and the means by which it is achieved.


Device-Level Configuration


Network device management is a step up in the abstraction ladder from procedural coding, but is still seen as a detailed and arcane process to those who aren't experienced with it. Like the developers, the networking team became the buffer, but this time it was between a high-level desired state and the detailed configuration of same. The details became a little less daunting to look at and more input came from the people making the business decisions.


More and more, the networking team had to be aware of the greater business cases for what we were doing so that we could better align with the overall goals of the business. We implemented business policy in the language of the network, even if we weren't fully aware of it at the time.


Large-Scale Models

Now, we're slowly beginning to move to full-scale network automation and orchestration. We can potentially address the network as a single process, allowing our organization's policies to map directly to the ongoing operation of the network itself. It's still pretty messy underneath the abstraction layers, but there's a lot of potential for the various automation/orchestration tools that are being developed to clean that up.


The Whisper in the Wires


Our network architecture should line up with our business processes. We've always been aware of this, but delivering on it has been difficult, mostly because we had the right idea but on too small a scale. New tools and models are being developed and refined, making it possible to this point, but it's up to us to embrace these and push them to their limits. The value of the network to our organizations depends on it. By extension, our value to our organizations depends on it.


I don't think we're ever going to reach a point where that old DWIT instruction is a reality, but is it too much to hope that we're getting closer?

Are configuration templates still needed with an automation framework? Or does the automation policy make them a relic of the past?


Traditional Configuration Management: Templates


We've all been there. We always keep a backup copy of a device's configuration, not just for recovery, but to use as a baseline for configuring similar devices in the future. We usually strip out all of the device-specific information and create a generic template so that we're not wasting time removing the previous device's details. Sometimes we'll go so far as to embed variables so that we can use a script to create new configurations from the template and speed things up, but the principle remains the same: We have a configuration file for each device, based on the template of the day. The principle has worked for years, only complicated by that last bit. When we change the template, it's a fair bit of work to regenerate the initial configurations to comply, especially if changes have been made that aren't covered by the template. Which almost always happens because we're manually making changes to devices as new requirements surface.


Modern Configuration Management: Automation Frameworks


Automation (ideally) moves us away from manual changes to devices. There's no more making ad hoc changes to individual devices and hoping that someone documents it sufficiently to incorporate it into the templates. We incorporate new requirements into the automation policy and let the framework handle it. One of those requirements is usually going to be a periodic backup of device configurations so that a current copy is available to provision new devices, which amounts to using automation to create static configurations.


One Foot Forward and One Foot Behind


Templates and the custom configuration files built from them are almost always meant to serve as initial configurations. Once they're deployed and the device is running, their role is finished until the device is replaced or otherwise needs to be configured from scratch.


The automation framework, on the other hand, plays an active role in the operation of the network while each device is running. Until the devices are online and participating in the network, automation can't really touch them directly.


This has led to common practice where both approaches are in play. Templates are built for initial configuration and automation is used for ongoing application of policy.


Basic Templates via Automation Framework


Most organizations I've worked with keep their templates and initial device configurations managed via some sort of version control system. Some will even go so far as to have their device configurations populated to a TFTP server for new systems to be able to boot directly from the network. No matter how far we take it, this is an ideal application for automation, too.


We can use automation to apply policies to templates or configuration files in a version control system or TFTP repository just as easily (possibly even more so) as we can to a live device. This doesn't need to be complex. Creating initial configuration files using the automation framework so that new devices are reachable and identifiable is sufficient. Once the device is up and running, the policy is going to be applied immediately, so there's no need to have more than the basics in the initial configuration.


The Whisper in the Wires


The more I work with automation frameworks, the more I'm convinced that all configuration management should be incorporated into the policy. Yes, some basic configurations may need to be pre-loaded to make the device accessible, but why maintain two separate mechanisms? The basic configurations can be managed just as easily as the functional devices can, so it's just an extra step. Is this something that those of you who are automating have already done, or are most of us still using one process for initial configuration and another for ongoing operation?

Is our contribution measured by what we're doing or how we're doing it? Are we providing value or are we just getting caught up in what's exciting? How do we ensure that we're seen as contributing despite appearing to be less busy? Do our efforts to automate sometimes take away from our value rather than adding?


Automate All of the Things!


Automation is based on the principle that our expertise should not be wasted on the manual execution of simple and repetitive tasks. Spending some extra time to offload these tasks and free ourselves up for more exciting undertakings just makes sense, right? Well, that depends on a few things.


The Human Factor (Working Hard, or Hardly Working?)


Perception is an easily-overlooked consideration. It's no secret that the work of IT professionals is sometimes seen as a bit arcane by our management and peers. There's a general understanding of what we do, but the details of how we get it done are often another story.


Are we being evaluated based on the value of the work that we do, or based on how busy we appear to be when we do it? If we minimize the daily grind, are we creating the impression that we're less valuable because we don't appear to be as busy? The answer to these questions is going to vary depending on our work environment, but it's important that we manage perception effectively from the beginning.


The Technical Factor (What's the Value Proposition?)


As IT professionals, we tend to be passionate about the work that we do, so it's really easy to get excited about coding our way out of the daily grind. When this happens, we sometimes let our excitement get the better of us and don't give due consideration to the real value of what it is that we're doing.


If we're spending a week to automate a task that previously wasted days each month, we have a reasonable return on our time investment. Yes, it might take a few months before we really see the benefits, but we will see them. On the other hand, if we're spending the same amount of time to address a task that only took a few minutes out of each month, we really have to start thinking about the efficient use of our time. We can be absolutely certain that our management and peers will be thinking about it if we're not.


The Whisper in the Wires


Are we approaching automation in the traditional way where we just develop a collection of scripts as we need them, or are we embracing a larger framework? At this point, I'm guessing that we have more of the former than the latter, but with a lot of push to take a more strategic approach. This is a good direction, but not if we're out of a job before we can get there because we didn't manage the expectations properly.


If we want to address concerns of perception and value proposition, we have to do more than script our pain away when we can. It's very difficult to manage either of these if we're addressing everything piecemeal. We need a consistent policy framework incorporated into a well-documented strategy, and we need to communicate that effectively to our peers.


A documented strategy addresses perception problems by providing a view into the process and setting expectations. A consistent policy framework keeps us from getting so caught up in our own custom script development that we fail to show value.

At what point does the frequency and volume of “it will only take a second to change” become too much to bear and force us to adopt a network automation strategy? Where is the greatest resistance to change? Is it in the technical investment required, or is it the habit of falling back to the old way of doing things because it's "easier" than the new way?


The Little Things


We all have those little tasks that we can accomplish in a heartbeat. They're the things we've done so many times that the commands required have almost become muscle memory, taking little to no thought to enter. They're the easy part of our jobs, right? Perhaps, but they can also be the most time consuming. There's a reason those commands have become so ingrained. We perform them far more than we should, but haven't necessarily figured that out yet... well, not until now anyway.


The solution? Network automation! Let's get all of those mind-numbingly simple day-to-day tasks taken care of by an automation framework so that we can free ourselves up for work that's actually challenging and rewarding. It's that easy! Or is it?


The Huge Amount of Work Required to Avoid a Huge Amount of Work


Automation, even with the best of tools, is a lot of work. That process itself is something that we will wish could be automated before we're done. There's the needs analysis; the evaluation and selection of an automation framework; training of staff to use it; and the building, documentation, and maintenance of the policies themselves.


When a significant portion of the drive for automation comes from overload, the additional technical workload of building an automation framework in parallel to the current way of doing things can be daunting. Yes, it's definitely a case of working smarter rather than harder, but that's still hard to swallow when we're buried in the middle of both.


The Cultural Shift


People are creatures of habit, especially when those habits are deeply ingrained. When all of those little manual network changes have reached the point that they can be done without real thought, we can be absolutely sure that they're deeply seated and aren't going to be easy to give up. The actual technical work to transition to network automation was only half of the challenge. Now we have to deal with changing people's thinking on the matter.


Here's a place where we really shouldn't serve two masters. If there isn't full commitment to the new process, the investment in it yields diminishing returns. The automation framework can make network operations far more efficient, but not if everyone is resistant to it and is continuing to do things the old way. There needs to either be incentive to adopt the new framework 100% or discouragement from falling into habitual behaviour. This could even represent a longer process than the technical side of things.


What Must Be Done


Neither the technical hurdles nor the human ones remove the ultimate need to automate. The long-term consequences of repeatedly wasting time on simple tasks, both to individuals' technical skills and job satisfaction and the efficiency of the organization, makes a traditional approach to networking unsustainable. This is especially true at any kind of scale. Growth and additional workload only serve to make the problem more apparent and the solution more difficult to implement. Still, there's no question that it needs to be done. The real questions revolve around how best to handle the transition.


The Whisper in the Wires


It's difficult to say what's harder, the technical transition to network automation itself, or ensuring that it becomes the new normal. By the time we reach a point where it becomes necessary, we may have painted ourselves into a corner with a piled-up workload that should have been automated in the first place. It also represents a radical change in how things are done, which is going to produced mixed reactions that have to be factored in.


For those of you who have automated your networks, whether in large installations or small, at what point did you realize that doing things the old was no longer a viable option? What did you do to ensure a successful transition both technically and culturally?



In previous weeks, we talked about application-aware monitoring, perspective monitoring and agent/responder meshes to get a decentralized view of how our network is functioning.


With our traditional network monitoring system (NMS), we have a device-level and interface-level view. That's becoming less and less true as modern software breaks the mould of tradition, but it's still the core of its functionality. Now we have added perspective monitoring and possibly some agent/responder monitoring to the mix. How do we correlate these so that we have a single source of meaningful information?


Maybe We Don't?


Describing the use of the phrase "Single Pane of Glass" (SPoG) in product presentations as "overused" is an understatement. The idea of bringing everything back to a single view has been the holy grail of product interface design for some time. This makes sense, as long as all of that information is relevant to what we need at the time. With our traditional NMS, that SPoG is usually the dashboard that tells us whether the network is operating at baseline levels or not.


Perspective monitoring and agent/responder meshes can gather a lot more data on what's going on in the network as a whole. We have the option of feeding all that directly into the NMS, but is that where we're going to get the best perspective?


Data Visualization


We're living in a world of big data. The more we get, the less likely it becomes that we will be able to consume it in a meaningful way. Historically, we have searched for the relevant information in our network and filtered out what isn't immediately relevant. Big data is teaching us that it's all relevant, at least when taken as a whole.


Enter log aggregators and data visualization systems. Most of the information that we're getting from our decentralized tools can be captured in such a way as to feed these systems. Instead of just feeding it into the NMS, we have the option of collecting all of this data into custom visuals. These can give us single view not only of where the network is experiencing chronic problems, but of where we need to adjust our baselines.


Whether we're looking at Elastic Stack, Splunk, Tableau or other tools. The potential to capture the gestalt of our network's data and present it usefully is worthwhile.




What if there's something in all this that indicates unacceptable performance or a failure? Yes, that should raise alerts in our NMS.


This isn't an either/or thing. It's a complementary approach. There's no reason why the data from our various agents and probes can't feed both. Depending on the tool that's used, the information can even be forwarded directly from the visualizer, simplifying the collection process.


The Whisper in the Wires


Depending on what we’re looking for, there’s more than one tool for the job. Traditionally, we’re observing the network for metrics that fall outside of our baselines, particularly when those have catastrophic impact to operations. This is essential for timely response to immediate problems. Moving forward, we also want a bird’s eye view of how our applications and links are behaving, which may require a more flexible tool.


Has anyone else looked at implementing data visualization tools to complement their NMS dashboards?

Last week, we talked about monitoring the network from different perspectives. By looking at how applications perform from different points in the network, we get an approximation of the users' experience. Unfortunately, most of those tools are short on the details surrounding why there's a problem or are limited in what they can test.

On one end of our monitoring spectrum, we have traditional device-level monitoring. This is going to tell us everything we need to know that is device-specific. On the other end, we have the application-level monitoring discussed in the last couple of weeks. Here, we're going to approximate a view of how the end users see their applications performing. The former gives us a hardware perspective and the latter gives us a user perspective. Finding the perspective of the network as a whole is still somewhere between.

Using testing agents and responders on the network at varying levels can provide that intermediate view. They allow us to test against all manner of traffic, factoring in network latency and variances (jitter) in the same.

Agents and Responders

Most enterprise network devices have built-in functions for initiating and responding to test traffic. These allow us to test and report on the latency of each link from the device itself. Cisco and Huawei have IP Service Level Agreement (SLA) processes. Juniper has Real-Time Performance Monitoring (RPM) and HPE has its Network Quality Analyzer (NQA) functions, just to list a few examples. Once configured, we can read the data from them via Simple Network Management Protocol (SNMP) and track their health from our favourite network monitoring console.

Should we be in the position of having an all-Cisco shop, we can have a look at SolarWinds' IP SLA Monitor and VoIP and Network Quality Manager products to simplify setting things. Otherwise, we're looking at a more manual process if our vendor doesn't have something similar.


Observing test performance at different levels gives us reports of different granularity. By running tests at the organization, site and link levels, we can start with the bigger picture's metrics and work our way down to specific problems.


Most of these will be installed at the edge devices or close to them. They will perform edge-to-edge tests against a device at the destination organization or cloud hosting provider. There shouldn't be too many of these tests configured.


Site-to-site tests will be configured close to the WAN links and will monitor overall connectivity between sites. The point of these tests is to give a general perspective on intersite traffic, so they shouldn't be installed directly on the WAN links. Depending on our organization, there could be none of these or a large number.


Each network device has a test for each of its routed links to other network devices to measure latency. This is where the largest number of tests are configured, but is also where we are going to find the most detail.


Agent and responder testing isn't passive. There's always the potential for unwanted problems caused by implementing the tests themselves.


Agent and responder tests introduce traffic to the network for purposes of testing. While that traffic shouldn't be significant enough to cause impact, there's always the possibility that it will. We need to keep an eye on the interfaces and queues to be sure that there isn't any significant change.

Frequency and Impact

Running agents and responders on the network devices themselves are going to generate additional CPU cycles. Network devices as a whole are not known for having a lot of processing capacity. So, the frequency for running these tests may need to be adjusted to factor that in.

Processing Delay

Related to the previous paragraph, most networking devices aren't going to be performing these tests quickly. The results from these tests may require a bit of a "fudge factor" at the analysis stage to account for this.

The Whisper in the Wires

Having a mesh of agents and responders at the different levels can provide point-in-time analysis of latencies and soft failures throughout the network. But, it needs to be managed carefully to avoid having negative impacts to the network itself.

Thanks to Thwack MVP byrona for spurring some of my thinking on this topic.

Is anyone else building something along these lines?

Last week we talked about application-aware monitoring. Rather than placing our focus on the devices and interfaces, we discussed getting data that approximates our users' experiences. These users, are going to be distributed around the organization at least.  They may even be scattered around the Internet, depending on the scope of our application. We need to examine application performance from different perspectives to get a complete picture.

Any way we look at it, we're going to need active remote probes/agents to accomplish what we're looking for. Those should be programmable to emulate application behaviour, so that we can get the most relevant data. At the least, having something that can measure basic network performance from any point on the network is necesary. There are a few options.


Last week, I was invited to Tech Field Day 12 as a delegate and had the opportunity to sit in on the first session of Networking Field Day 13 as a guest. Coincidentally, SolarWinds was the first presenter. Even more coincidentally, they were showing off the NetPath feature of Network Performance Monitor (NPM) 12. This product, while not yet fully programmable to emulate specific applications, provides detailed hop-by-hop analysis from any point at which an agent/probe can be placed. In addition, it maintains a performance history for those times when we get notification of a problem well after the fact. For those of you working with NPM 12, I'm going to recommend you have a very close look at NetPath as a beginning for this sort of monitoring. One downside of the NetPath probes is the requirement to have a Windows Professional computer running at each agent location. This makes it a heavier and more costly option, but well worth it for the information that it provides. Hopefully, the SolarWinds folks will look into lightweight options for the probe side of NetPath in the future. We're only at 1.0, so there's a lot of room for growth and development.

Looking at lighter, though less full-featured options, we have a few. They're mostly roll-your own solutions, but this adds flexibility at the cost of ease.

Lightweight VMs and ARM Appliances

If there's a little bit of room on a VM somewhere, that's enough space for a lightweight VM to be installed. Regular application performance probes can be run from these and report directly to a monitoring station via syslog or SNMP traps. These custom probes can even be controlled remotely by executing them via SSH.

In the absence of VM space, the same sort of thing can be run from a small ARM computer, like a Raspberry Pi. The probe device itself can even be powered by the on-board USB port of another networking device nearby.

Going back to NetPath for a moment, one option for SolarWinds is to leverage Windows Embedded and/or Windows IoT as a lightweight option for NetPath probes. This is something I think would be worth having a look at.

On-device Containers

A few networking companies (Cisco's ISR 4K line, for example) have opened up the ability to run small custom VMs and containers on the device itself. This extends the availability of agents/probes to locations where there are no local compute resources available.

Built-in Router/Switch Functions

Thwack MVP byrona had a brilliant idea with his implementation of IP SLA in Cisco routers and having Orion collect the statistics, presumably via SNMP. This requires no additional hardware and minimal administrative overhead. Just set up the IP SLA process and read the statistics as they're generated.

The Whisper in the Wires

NetPath is looking like a very promising approach to monitoring from different points of view. For most other solutions, we're unfortunately still mostly at the roll-your own stage. Still, we're seeing some promising solutions on the horizon.

What are you doing to get a look at your application performance from around the network?

Over the next few posts, I'm going to explore some newer thinking in network monitoring. We start by designing centralized management stations and remote agents, but is this sufficient? We look at things from a device and interface perspective and establish baselines of how the network should operate. This works, but is something of a one-size-fits-all solution. Where do we look when we want more than this?


A Little Bit of SDN


Over the last few years, Software Defined Networking (SDN) has generated a lot of buzz. Ask 10 different people what it's all about and you'll likely get 10 different answers, none of which is really incorrect. It's a term that has a definition that tends to exceed usability. Still, SDN maintains mind share because the components people tend to associate with it are desirable. These include, amongst others, centralized management and/or control, programmable configuration management and application-aware networking. This last one is of the most immediate interest for our current topic.


The network's performance as it relates to the key applications running on it is immediately relevant to the business. This isn't just the performance of the application across a single device. This is a look at the gestalt of the application's performance across the entire network. This allows  detection of problems and performance where they matter most. Drilling down to the specific devices and interfaces that have impact can come later.


Network Tomography


Recently, I ran across a new term, or at least it was a new term to me: Network Tomography. This describes the gathering of a network's characteristics from endpoint data. The "tomography" part of the term comes from medical technologies like Magnetic Resonance Imaging (MRI), where the internal characteristics of an object are derived from the outside. Network tomography isn't really tomography, but the term conveys the meaning fairly well. The basic idea is to detect loss or delay over a path by using active probes from the endpoints and recording the results.


Monitoring loss and delay over a path is a beginning. Most application performance issues are going to be covered by this approach. We can track performance from many locations in the network and still report the results centrally, giving us a more complete picture of how the business is using the network.


Application Metrics


If we're going to look at network performance from the applications' view, we'll need a fingerprint of what those applications do. Many pieces of software use a single network access method, such as Hypertext Transfer Protocol (HTTP) and can be tracked easily. Others use many different data access methods and will need more complex definition. Either way, we need to monitor all of the components in order to recognize where the problems lie. Getting these details may be as simple as speaking to the vendor or more complex and requiring some packet analysis. Regardless, if we have problems with one aspect of the applications' communications but not another, the experience is still sub-par and we may not know why.


The Whisper in the Wires


We've all run into frustrated users claiming that "the network is slow" and have torn our hair out trying to find out the specifics of what that means. Ultimately, technical specifics aside, that is specifically what it means. We're just not used to looking at it that way.


What do we need to take this into practice? Smarter agents? Containers or VMs that allow us to either locally or remotely test application performance? How to we automate this? I'll be giving my own perspective over the next few weeks, but would like to hear your thoughts on it in the meantime.

A network monitor uses many mechanisms to get data about the status of its devices and interconnected links. In some cases, the monitoring station and its agents collect data from devices. In others, the devices themselves report information to the station and agents. Our monitoring strategy should use the right means to get the information we need with the least impact to the network's performance. Let's have a look at the tools available to us.

Pull Methods

We'll start with pull methods, where the network monitoring station and agents query devices for relevant information.


SNMP, the Simple Network Management Protocol has been at the core of query-based network monitoring since the early 1990s. It began with a simple method for accessing a standard body of information called a Management Information Base or MIB. Most equipment and software vendors have embraced and extended this body to varying degrees of consistency.

Further revisions (SNMPv2c, SNMPv2u and SNMPv3) came along in the 19d early 2000s. These respectively added some bulk information retrieval functionality and improved privacy and security.

SNMP information is usually polled from the monitoring station at five-minute intervals. This allows the station to compile trend data for key device metrics. Some of these metrics will be central to device operation: CPU use, free memory, uptime, &c. Others will deal with the connected links: bandwidth usage, link errors and status, queue overflows, &c.

We need to be careful when setting up SNMP queries. Many networking devices don't have a lot of spare processor cycles for handling SNMP queries, so we should minimize the frequency and volume of retrieved information. Otherwise, we risk impact to network performance just by our active monitoring.

Query Scripts

SNMP is an older technology and the information that we can retrieve can be limited. When we need to get information that isn't available through a query, we need to resort to other options. Often, script access to the device's command-line interface (CLI) is the simplest method. Utilities like expect or scripting languages like python or go will allow information to be extracted by filtering CLI output to extract necessary data.

Like SNMP, we need to be careful about taxing the devices we're querying. CLI output is usually much more detailed than an SNMP query and requires more work on the part of the device to produce it.

Push Methods

Push methods are the second group of information gathering techniques. With these, the device is sending the information to the monitoring station or its agents without first being asked.


SNMP has a basic push model where the device sends urgent information to the monitoring station and/or agents as events occur. These SNMP traps cover most changes in most categories that we want to know about right away. For the most part, they trigger on fixed occurrences: interface up/down, routing protocol peer connection status, device reboot, &c.


RMON, Remote Network MONitoring was developed as an extension to SNMP. It puts more focus on the information flowing across the device than on the device itself and is most often used to define conditions under which an SNMP trap should be sent. Where SNMP will send a trap when a given event occurs, RMON can have more specific triggers based on more detailed information. If we're concerned about CPU spikes, for example, we can have RMON send a trap when CPU usage goes up too quickly.


Most devices will send a log stream to a remote server for archiving and analysis. By tuning what is sent at the device level, operational details can be relayed to the monitoring station in near real time. The trick is to keep this information filtered at the transmitting device so that only the relevant information is sent.

Device Scripting

Some devices, particularly the Linux-based ones, can run scripts locally and send the output to the monitoring server via syslog or SNMP traps. Cisco devices, for example can use the Tool Command Language (TCL) or Embedded Event Manager (EEM) applets to provide this function.

Your Angle

Which technologies are you considering for your network monitoring strategies? Are you sticking with the old tried and true methods and nothing more, or are you giving some thought to something new and exciting?

Getting Started

The network monitoring piece of network management can be a frightening proposition. Ensuring that we have the information we need is an important step, but it's only the first of many. There's a lot of information out there and the picking and choosing of it can be an exercise in frustration.

A Story

I remember the first time I installed an intrusion detection system (IDS) on a network. I had the usual expectations of a first-time user. I would begin with shining a spotlight on denizens of the seedier side of the network as they came to my front door. I would observe with an all-seeing eye and revel in my newfound awareness. They would attempt to access my precious network and I would smite their feeble attempts with.... Well, you get the idea.


It turns out there was a lot more to it than I expected and I had to reevaluate my position. Awareness without education doesn't help much. My education began when I realized that I had failed to trim down the signatures that the IDS was using. The floodgates had opened, and my logs were filling up with everything that had even a remote possibility of being a security problem. Data was flowing faster than I could make sense of it. I had the information I needed and a whole lot more, but no more understanding of my situation than I had before. I won't even get into how I felt once I considered that this data was only a single device's worth.


After a time, I learned to tune things so that I was only watching for the things I was most concerned about. This isn't an unusual scenario when we're just getting started with monitoring. It's our first jump into the pool and we often go straight to the deep end, not realizing how easy it is to get in over our heads. We only realize later on that we need to start with the important bits and work our way up.

The Reality

Most of us are facing the task of monitoring larger interconnected systems. We get data from many sources and attempt to divine meaning out of the deluge. Sometimes the importance of what we're receiving is obvious and relevant. eg. A message with critical priority telling us that a device is almost out of memory. In other cases, the applicability of the information isn't as obvious. It just becomes useful material when we find out about a problem through other channels.


That obvious and relevant information is the best place to start. When the network is on the verge of a complete meltdown, those messages are almost always going to show up first. The trick is in getting them in time to do something about them.


Most network monitoring installations begin with polling devices for data. This may start with pinging the device to make sure it's accessible. Next, comes testing the connections to the services on the device to make sure that none of them have failed. Querying the device's well-being with Simple Network Management Protocol (SNMP) usually accompanies this too. What do these have in common? The management station is asking the network devices, usually at five minute intervals, how things are going. This is essential for collecting data for analysis and getting a picture of how things are going when everything is running. For critical problems, something more is needed.


This is where syslog and SNMP traps come into play. This data is actively sent from the monitored devices as events occur. There is no waiting for five minute intervals to find out that the processor has suddenly spiked to 100% or that a critical interface has gone down. The downside is that there is usually a lot more information presented than is immediately necessary. This is the same kind of floodgate scenario I ran into in my earlier example. Configuring syslog to send messages at the "error" level and above is an important sanity-saving measure. SNMP traps are somewhat better for this as they report on actual events instead of every little thing that happens on the device.

The Whisper in the Wires

Ultimately, network monitoring is about two things:


  1. Knowing where the problems are before anyone else knows they're there and being able to fix them.

  2. Having all of the trend data to understand where problems are likely to be in the future. This provides the necessary justification to restructure or add capacity before they become a problem.


The first of these is the most urgent one. When we're building our monitoring systems, we need to focus on the critical things that will take our networks down first. We don't need to get sidetracked by the pretty pictures and graphs... at least not until that first bit is done. Once that's covered, we can worry about the long view.


The first truth of RFC 1925 "The 12 Networking Truths" is that it has to work. If we start with that as our beginning point, we're off to a good start.

In previous discussions about increasing the effectiveness of monitoring, it has been pointed out that having more eyes on the data will yield more insight into the problem. While this is true, part of the point of SIEM is to have more automated processes correlating the data so that the expense of additional human observation can be avoided. Still, no automated process quite measures up to the more flexible and intuitive observations of human beings. What if we look at a hybrid approach that didn’t require hiring additional security analysts?


In addition to the SIEM’s analytics, what if we took the access particulars of each user in a group and published a daily summary of what (generally) was accessed, where from, when and for how long to their immediate team and management? Such a summary could have a mechanism to anonymously flag access anomalies. In addition, the flagging system could have an optional comment on why this is seen as an abnormal event. e.g. John was with me at lunch at the time and couldn’t have accessed anything from his workstation.


Would something like this make the security analysis easier by having eyes with a vested interest in the security of their work examining the summaries? Would we be revealing too much about the target systems and data? Are we assuming that there is sufficient interest on the part of the team to even bother reading such a summary?


Thinking more darkly, is this a step onto a slippery slope of creating an Orwellian work environment? Or… is this just one more metre down a slope we’ve been sliding down for a long time?

When implementing a SIEM infrastructure, we’re very careful to inventory all of the possible vectors of attack for our critical systems, but how carefully do we consider the SIEM itself and its logging mechanisms in that list?


For routine intrusions, this isn’t really a consideration. The average individual doesn’t consider the possibility of being watched unless there is physical evidence (security cameras, &c) to remind them, so few steps are taken to hide their activities… if any.


For more serious efforts, someone wearing a black hat is going to do their homework and attempt to mitigate any mechanisms that will provide evidence of their activities. This can range from simple things like…


  • adding a static route on the monitored system to direct log aggregator traffic to a null destination
  • adding an outbound filter on the monitored system or access switch that blocks syslog and SNMP traffic


… to more advanced mechanisms like …


  • installing a filtering tap to block or filter syslog, SNMP and related traffic
  • filtering syslog messages to hide specific activity


Admittedly, these things require administrator-level or physical access to the systems in question, which is likely to trigger an event in the first place, but we also can’t dismiss the idea that some of the most significant security threats originate internally. I also look back to my first post about logging sources and wonder if devices like L2 access switches are being considered as potential vectors. They're not in the routing path, but they can certainly have ACLs applied to them.


I don’t wear a black hat, and I’m certain that the things I can think of are only scratching the surface of possible internal attacks on the SIEM infrastructure.


So, before I keep following this train of thought and start wearing a tin foil hat, let me ask these questions?


Are we adequately securing and monitoring the security system and supporting infrastructure?

If so, what steps are we taking to do so?

How far do we take this?

Security management and response systems are often high-profile investments that occur only when the impact of IT threats to the business are fully appreciated by management. At least in the small and midmarket space, this understanding only rarely happens before the pain of a security breach, and even then enlightenment comes only after repeated exposure. When it does, it's amazing how seriously the matter is taken and how quickly a budget is established. Until this occurs, however, the system is often seen as a commodity purchase rather than an investment in an ongoing business-critical process.


Unfortunately, before the need is realized, there is often little will on the part of the business to take some action. In many cases, organizations are highly resistant to even a commodity approach because they haven't yet suffered a breach. One might think that these cases are in the minority, but as many as 60% of businesses either have an outdated "We have a firewall, so we're safe!" security strategy or no security strategy at all.
[Source: Cisco Press Release: New Cisco Security Study Shows Canadian Businesses Not Prepared For Security Threats - December 2014]


Obviously, different clients will be at varying stages of security self-awareness, with some a bit further along than others. For the ones that have nothing, they need to be convinced that a security strategy is necessary. For others, they need to be persuaded that a firewall or other security appliance is only a part of the necessary plan and not the entirety of it. No matter where they stand, the challenge is in convincing them of the need for a comprehensive policy and management process before they are burned by an intrusion and without appearing to use scare tactics.


What approaches have you taken to ensure that the influencers and decision makers appreciate the requirements before they feel the pain?

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.