Geek Speak

14 Posts authored by: jordan.martin

Logic and objective thinking are hallmarks of any engineering field. IT design and troubleshooting are no exceptions. Computers and networks are systems of logic so we, as humans, have to think in such terms to effectively design and manage these systems. The problem is that the human brain isn’t exactly the most efficient logical processing engine out there. Our logic is often skewed by what are called cognitive biases. These biases take many potential forms, but ultimately they skew our interpretation of information in one way or another. This leaves us believing we are approaching a problem logically, but in reality are operating on a distorted sense of reality.


What am I talking about? Below are some common examples of cognitive biases that I see all the time as a consultant in enterprise environments. This is by no means a comprehensive list. If you want to dig in further, Wikipedia has a great landing page with brief descriptions and links to more comprehensive entries on each.


Anchoring: Anchoring is when we value the information we learn first as the most important, with subsequent learned information having less weight or value. This is common in troubleshooting, where we often see a subset of symptoms before understanding the whole problem. Unless you can evaluate the value of your initial information against subsequent evidence, you’re likely to spin your wheels when trying to figure out why something is not working as intended.


Backfire effect: The backfire effect is what happens when someone further invests into an original idea or hypotheses, even when new evidence is learned that disproves the initial belief. Some might call this pride, but ultimately no one wants to be wrong even if it’s justifiable because all evidence wasn’t available when forming the original opinion or thought. I’ve seen this clearly demonstrated in organizations that have a blame-first culture. Nobody wants to be left holding the bag, so there is more incentive to be right than to solve the problem.


Outcome bias: This bias is our predisposition to judge a decision based on the outcome, rather than how logical of a decision it was at the time it was made. I see this regularly from insecure managers who are looking for reasons for why things went wrong. It plays a big part in blame culture. This can lead to decision paralysis when we are judged by outcomes we can’t control, rather than a methodical way of working through an unknown root cause.


Confirmation bias: With confirmation bias, we search for, and ultimately give more weight to, evidence that supports our original hypotheses or belief of the way things should be. This is incredibly common in all areas of life, including IT decision making. It reflects more on our emotional need to be right than any intentional negative trait.


Reactive devaluation: This bias is when someone devalues or dismisses an opinion not on merit, but on the fact that it came from an adversary, or someone you don’t like. I’m sure you’ve seen this one, too. It’s hard to admit when someone you don’t respect is right, but by not doing so, you may be dismissing relative information in your decision-making process.


Triviality/Bike shedding: This occurs when extraordinary attention is applied to an insignificant detail to avoid having to deal with the larger, more complex, or more challenging issue. By deeply engaging in a triviality, we feel like we provide real value to the conversation. The reality is that we expend cycles of energy on things that ultimately don’t need that level of detail applied.


Normalcy bias: This is a refusal to plan for or acknowledge the possibility of outcomes that haven’t happened before. This is common when thinking about DR/BC because we often can’t imagine or process things that have never occurred before. Our brains immediately work to fill in gaps based off our past experiences, leaving us blind to potential outcomes.


I point out the above examples just to demonstrate some of the many cognitive biases that exist in our collective way of processing information. I’m confident that you’ve seen many of them demonstrated yourself, but ultimately, they continue to persist because of the most challenging bias of them all:


Bias blind spot: This is our tendency to see others as more biased than ourselves, and not being able to identify as many cognitive biases in our own actions and decision making. It’s the main reason many of these persist even after we learn about them. Biases are often easy to identify when others demonstrate them, but we often can’t see our own biases when our thinking is being impacted by a bias like those above. The only way to identify our own biases is through an honest and self-reflective post mortem of decision making, looking specifically for areas where our bias impacted our view of reality.


Final Thoughts


Even in a world dominated by objectivity and logical thinking, cognitive biases can be found everywhere. It’s just one of the oddities of the human condition. And bias affects everyone, regardless of intent. If you’ve read the list above and have identified a bias that you’ve fallen for, there’s nothing to be ashamed of. The best minds in the world have the same flaws. The only way to overcome these biases is to inform yourself of them, identify which ones you typically fall prey to, and actively work against those biases when trying to approach a subject objectively. It’s not the mere presence of bias that is a problem. Rather, it’s the lack of awareness of bias that leads people to incorrect decision making.

In system design, every technical decision can be seen as a series of trade-offs. If I choose to implement Technology A it will provide a positive outcome in one way, but introduce new challenges that I wouldn’t have if I had chose Technology B. There are very few decisions in systems design that don’t come down to tradeoffs like this. This is the fundamental reason why we have multiple technology solutions that solve similar problem sets. One of the most common tradeoffs we see is in how tightly, or loosely, technologies and systems are coupled together.  While coupling is often a determining factor in many design decisions, many businesses aren’t directly considering the impact of coupling in their decision making process. In this article I want to step through this concept, defining what coupling is and why it matters when thinking about system design.


We should start with a definition. Generically, coupling is a term we use to indicate how interdependent individual components of a system are. A tightly coupled system will be highly interdependent, where a loosely coupled system will have components that run independent from each other. Let’s look at some of the characteristics of each.


Tightly coupled systems can be identified by the following characteristics:


  • Connections between components in the system are strong
  • Parts of the system are directly dependent on one another
  • A change in one area directly impacts other areas of the system
  • Efficiency is high across the entire system
  • Brittleness increases as complexity or components are added to the system


    Loosely coupled systems can be identified by the following characteristics:


  • Connections between components in the system are weak
  • Parts within the system run independently of other parts within the system
  • A change in one area has little or no impact on other areas of the system
  • Sub-optimal levels of efficiency are common
  • Resiliency increases as components are added


So which is better?


Like all proper technology questions, the answer is “It depends!”  The reality is that technologies and architectures sit somewhere on the spectrum between completely loose and completely tight, with both having advantages and disadvantages.


When speaking of systems, efficiency is almost always something we’re concerned about so tight coupling seems like a logical direction to look. We want systems that act in a completely coordinated fashion, delivering value to the business with as little wasted effort or resources as possible. It’s a noble goal. However, we often have to solve for resiliency as well, which logically points to loosely coupled systems. Tightly coupled systems become brittle because every part is dependent on the other parts to function. If one part breaks, the rest are incapable of doing what they were intended to do. This is bad for resiliency.


This is better understood with an example, so let’s use DNS as a simple one.


Generally speaking, using DNS instead of directly referencing IP addresses gives efficiency and flexibility to your systems. It allows you to redirect traffic to different hosts at will by modifying a central DNS record rather than having to change an IP address reference in multiple locations. It also is a great central information repository on how to reach many devices on your network. We often recommend that applications should use DNS lookups, rather than direct IP address references, because of the additional value it provides. The downside is that this name reference now introduces a false dependency. Many of your applications can work perfectly fine without referring to DNS, but by introducing it into them you have tightened coupling between the DNS system and your application. An application which could previously run independently now depends on name resolution and your applications fails if DNS fails.


In this scenario you have a decision to make. Does the value and efficiency of adding DNS lookups to your application outweigh the deterrent of now needing both systems up and running for your application to work. You can see this is a very simple example, but as we begin layering technology, on top of technology, the coupling and dependencies can become both very strong and very hard to actually identify. I’m sure many of you have been in the situation where the failure of one seemingly unrelated system has impacted another system on your network. This is due to hidden coupling, interaction surfaces, and the law of unintended consequences.


To answer the question “Which is better?” again, there is no right answer. We need both. There are times where highly coordinated action is required. There are times when high levels of resilience is required. Most commonly we need both. When designing and deploying systems, coupling needs to be considered so you can mitigate the downsides of each while taking advantages of the positives they provide.

If you have done any work in enterprise networks, you are likely familiar with the idea of a chassis switch. They have been the de facto standard for campus and data center cores and the standard top tier in a three-tier architecture for quite some time, with the venerable and perennial Cisco 6500 having a role in just about every network that I’ve ever worked on. They’re big and expensive, but they’re also resilient and bulletproof. (I mean this in the figurative and literal sense. I doubt you can get a bullet through most chassis switches cleanly.) That being said, there are some downsides to buying chassis switches that don’t often get discussed. In this post, I’m going to make a case against chassis switching. Not because chassis switching is inherently bad, but because I find that a lot of enterprises just default to the chassis as a core because that’s what they’re used to. To do this I’m going to look at some of the key benefits touted by chassis switch vendors and discussing how alternative architectures can provide these features, potentially in a more effective way.


High Availability


One of the key selling features of chassis switching is high availability. Within a chassis, every component should be deployed in N+1 redundancy. This means you don’t just buy one fancy and expensive supervisor, you buy two. If you’re really serious, you buy two chassis, because the chassis itself is an unlikely, but potential, single point of failure. The reality is that most chassis switches live up to the hype here. I’ve seen many chassis boxes that have been online for entirely too long without a reboot (patching apparently is overrated). The problem here isn’t a reliability question, but rather a blast area question. What do I mean by blast area? It’s the number of devices that are impacted if the switch has an issue. Chassis boxes tend to be densely populated with many devices either directly connected or dependent upon the operation of that physical device.


What happens when something goes wrong? All hardware eventually fails, so what’s the impact of a big centralized switch completely failing? Or more importantly, what’s the impact if it’s misbehaving, but hasn’t failed completely? (Gray-outs are the worst.) Your blast radius is significant and usually comprises most or all of the environment behind that switch. Redundancy is great, but it usually assumes total failure. Things don’t always fail that cleanly.


So, what’s the alternative? We can learn vicariously from our friends in Server Infrastructure groups and deploy distributed systems instead of highly centralized ones. Leaf-spine, a derivative of Clos networks, provides a mechanism for creating a distributed switching fabric that allows for up to half of the switching devices in the network to be offline with the only impact to the network being reduced redundancy and throughput. I don’t have the ability to dive into the details on leaf-spine architectures in this post, but you can check out this Packet Pushers Podcast if you would like a deeper understanding of how they work. A distributed architecture gives you the same level of high availability found in chassis switches but with a much more manageable scalability curve. See that section below for more details on scalability.




Complexity can be measured in many ways. There’s management complexity, technical complexity, operational complexity, etc. Fundamentally though, complexity is increased with the introduction and addition of interaction surfaces. Most networking technologies are relatively simple when operated in a bubble (some exceptions do apply) but real complexity starts showing up when those technologies are intermixed and running on top of each other. There are unintended consequences to your routing architecture when your spanning-tree architecture doesn’t act in a coordinated way, for example. This is one of the reasons why systems design has favored virtualization, and now micro-services, over large boxes that run many services. Operation and troubleshooting become far more complex when many things are being done on one system.


Networking is no different. Chassis switches are complicated. There are lots of moving pieces and things that need to go right, all residing under a single control plane. The ability to manage many devices under one management plane may feel like reducing complexity, but the reality is that it’s just an exchange of one type of complexity for another. Generally speaking it’s easier to troubleshoot a single purpose device than a multi-purpose device, but operationally it’s easier to manage one or two devices rather than tens or hundreds of devices.




You may not know this, but most chassis switches rely on Clos networking techniques for scalability within the chassis. Therefore, it isn’t a stretch to consider moving that same methodology out of the box and into a distributed switching fabric. With the combination of high speed backplanes/fabrics and multiple line card slots, chassis switches do have a fair amount of flexibility. The challenge is that you have to buy a large enough switch to handle anticipated and unanticipated growth over the life of the switch. For some companies, the life of a chassis switch can be expected to be upwards of 7-10 years. That’s quite a long time. You either need to be clairvoyant and understand your business needs half a decade into the future, or do what most people do: significantly oversize the initial purchase to help ensure that you don’t run out of capacity too quickly.


On the other hand, distributed switching fabrics grow with you. If you need more access ports, you add more leafs. If you need more fabric capacity, you add more spines. There’s also much greater flexibility to adjust to changing capacity trends in the industry. Over the past five years, we’ve been seeing the commoditization of 10Gb, 25Gb, 40Gb, and 100Gb links in the data center. Speeds of 400Gpbs are on the not-too-distant horizon, as well. In a chassis switch, you would have had to anticipate this dramatic upswing in individual link speed and purchase a switch that could handle it before the technologies became commonplace.




When talking about upgrading, there really are two types of upgrades that need to be addressed: hardware and software. We’re going to focus on software here, though, because we briefly addressed the hardware component above. Going back to our complexity discussion, the operation “under the hood” on chassis switches can often be quite complicated. With so many services so tightly packed into one control plane, upgrading can be a very complicated task. To handle this, switch vendors have created an abstraction for the processes and typically offer some form of “In Service Software Upgrade” automation. When it works, it feels miraculous. When it doesn’t, those are bad, bad days. I know few engineers who haven’t had ISSU burn them in one way or another. When everything in your environment is dependent upon one or two control planes always being operational, upgrading becomes a much riskier proposition.


Distributed architectures don’t have this challenge. Since services are distributed across many devices, losing any one device has little impact on the network. Also, since there is only loose coupling between devices in the fabric, not all devices have to be at the same software levels, like chassis switches do. This means you can upgrade a small section of your fabric and test the waters for a bit. If it doesn’t work well, roll it back. If it does, distribute the upgrade across the fabric.


Final Thoughts


I want to reiterate that I’m not making the case that chassis switches shouldn’t ever be used. In fact, I could easily write another post pointing out all the challenges inherent in distributed switching fabrics. The point of the post is to hopefully get people thinking about the choices they have when planning, designing, and deploying the networks they run. No single architecture should be the “go-to” architecture. Rather, you should weigh the trade-offs and make the decision that makes the most sense. Some people need chassis switching. Some networks work better in distributed fabrics. You’ll never know which group you belong to unless you consider factors like those above and the things that matter most to you and your organization.

It’s a common story. Your team has many times more work than you have man hours to accomplish. Complexity is increasing, demands, are rising, acceptable delivery times are dropping, and your team isn’t getting money for more people. What are you supposed to do? Traditionally the management answer to this question is outsourcing but that word comes with many connotations and many definitions. It’s a tricky word that often instills unfounded fear in the hearts of operations staff, unfounded hope in IT management, and sometimes (often?) works out far better for the company providing the outsourcing than the company receiving the services. If you’ve been in technology for any amount of time, you’re likely nodding your head right now. Like I said, it’s a common story.


I want to take a practical look at outsourcing and, more specifically, what outsourcing will never solve for you. We’ll get to that in a second though. All the old forms of outsourcing are still there and we should do our best to define and understand them.


Professional outsourcing is when your company pays someone else to execute services for you and is usually because you have too many tasks to complete with too few people to accomplish them. This type of outsourcing solves for the problem of staffing size/scaling. We often see this for help desks, admin, and operational tasks. Sometimes it’s augmentative and sometimes it’s a means to replace a whole team. Either way I’ve rarely seen it be something that works all that well. My theory on this is that a monetary motivation will never instill the same sense of ownership that is found in someone who is a native employee. That being said, teams don’t usually use this to augment technical capacity. Rather, they use it to increase/replace the technical staff they currently have.


Outside of the staff augmentation style of outsourcing, and a form that usually finds more success, is process specific outsourcing. This is where you hire experts to provide an application that doesn’t make sense for you to build, or to do a specific service that is beyond reasonable expectation of handling yourself. This has had many forms and names over the years, but some examples might be credit card processing, application service providers, electronic health record software, etc…  Common modern names for this type of outsourcing is SaaS (Software-as-a-Service) and PaaS (Platform-as-a-Service). I say this works better because it’s purpose is augmenting your staff technical capacity, leaving your internal staff available to manage product/service.


The final and newest iteration of outsourcing I want to quickly define is IaaS (Infrastructure-as-a-Service) or public cloud. The running joke is that running in the cloud is simply running your software on someone else’s server, and there is a lot of truth in that. How it isn’t true is that the cloud providers have mastered the automation, orchestration, and scaling in the deployment of their servers. This makes IaaS a form of outsourcing that is less about staffing or niche expertise, and more about solving the complexity and flexibility requirements facing modern business. You are essentially outsourcing complexity rather than tackling it yours.


If you’ve noticed, in identifying the above forms of outsourcing above I’ve also identified what they truly provide from a value perspective. There is one key piece missing though and that brings me to the point of this post. It doesn’t matter how much you outsource, what type of outsourcing you use, or how you outsource it, the one thing that you can’t outsource is responsibility.


There is no easy button when it comes to designing infrastructure and none of these services provide you with a get out of jail free card if their service fails. None of these services know your network, requirements, outage tolerance, or user requirements as well as you do. They are simply tools in your toolbox and whether you’re augmenting staff to meet project demands, or building cloud infrastructure to outsource your complexity, you still need people inside your organization making sure your requirements are being met and your business is covered if anything goes wrong. Design, resiliency, disaster recovery, and business continuity, regardless of how difficult it is, will always be something a company will be responsible for themselves.


100% uptime is a fallacy, even for highly redundant infrastructures run by competent engineering staffs, so you need to plan for such failures. This might mean multiple outsourcing strategies or a hybrid approach to what is outsourced and what you keep in house. It might mean using multiple providers, or multiple regions within a single provider, to provide as much redundancy as possible.


I’ll say it again, because I don’t believe it can be said enough. You can outsource many things, but you cannot outsource responsibility. That ultimately is yours to own.

I remember the simpler days. Back when our infrastructure all lived in one place, usually just in one room, and monitoring it could be as simple as walking in to see if everything looked OK. Today’s environments are very different, with our infrastructure being distributed all over the planet and much of it not even being something you can touch with your hands. So, with ever increasing levels of abstractions introduced by virtualization, cloud infrastructure, and overlays, how do you really know that everything you’re running is performing the way you need it to? In networking this can be a big challenge as we often solve technical challenges by abstracting the physical path from the routing and forwarding logic. Sometimes we do this multiple times, with overlays, existing within overlays, that all run over the same underlay. How do you maintain visibility when your network infrastructure is a bunch of abstractions?  It’s definitely a difficult challenge but I have a few tips that should help if you find yourself in this situation.


Know Your Underlay - While all the fancy and interesting stuff is happening in the overlay, your underlay acts much like the foundation of a house. If it isn’t solid, there is no hope for everything built on top of it to run the way you want it to. Traditionally this has been done with polling and traps, but the networking world is evolving, and newer systems are enabling real-time information gathering (streaming telemetry). Collecting both old and new styles of telemetry information and looking for anomalies will give you a picture of the performance of the individual components that comprise your physical infrastructure. Problems in the underlay effect everything so this should be the first step you take, and the one your most likely familiar with, to ensure your operations run smoothly.


Monitor Reality - Polling and traps are good tools, but they don’t tell us everything we really need to know. Discarded frames and interface errors may give us concrete proof of an issue, but they give no context to how that issue is impacting the services running on your network. Additionally, with more and more services moving to IaaS and SaaS, you don’t necessarily have access to operational data on third party devices. Synthetic transactions are the key here. While it may sound obvious to server administrators, it might be a bit foreign for network practitioners. Monitor the very things your users are trying to do. Are you supporting a web application?  Regularly send an HTTP request to the site and measure response time to completion. Measure the amount of data that is returned. Look for web server status codes and anomalies in that transaction. Do the same for database systems, and collaboration systems, and file servers… You get the idea. This is the proverbial canary in a coal mine and what lets you know something is up before the users end up standing next to your desk. The reality is that network problems ultimately manifest themselves as system issues to the end users, so you can’t ignore this component of your network.


Numbers Can Lie - One of the unwritten rules of visibility tools is to use the IP address, not the DNS name, in setting up pollers and monitoring. I mean, we’re networkers, right? IP addresses are the real source of truth when it comes to path selection and performance. While there is some level of wisdom to this, it omits part of the bigger picture and can lead you astray. Administrators may regularly use IP addresses to connect to and utilize the systems we run, but that is rarely true for our users, and DNS often is a contributing cause to outages and performance issues. Speaking again to services that reside far outside of our physical premises, the DNS picture can get even more complicated depending on the perspective and path you are using to access those services. Keep that in mind and use synthetic transactions to query your significant name entries, but also set up some pollers that use the DNS system to resolve the address of target hosts to ensure both name resolution and direct IP traffic are seeing similar performance characteristics.


Perspective Matters - It’s always been true, but where you test from is often just as important as what you test. Traditionally our polling systems are centrally located and close to the things they monitor. Proverbially they act as the administrator walking into the room to check on things, except they just live there all the time. This design makes a lot of sense in a hub style design, where many offices may come back to a handful of regional hubs for computing resources. But, yet again, cloud infrastructure is changing this in a big way. Many organizations offload Internet traffic at the branch, meaning access to some of your resources may be happening over the Internet and some may be happening over your WAN. If this is the case it makes way more sense to be monitoring from the user’s perspective, rather than from your data centers. There are some neat tools out there to place small and inexpensive sensors all over your network, giving you the opportunity to see the network through many different perspectives and giving you a broader view on network performance.


Final Thoughts


While the tactics and targets may change over time, the same rules that have always applied, still apply. Our visibility systems are only as good as the foresight we have into what could possibly go wrong. With virtualization, cloud, and abstraction playing larger roles in our network, it’s more important than ever to have a clear picture of what it is in our infrastructure that we should be looking for. Abstraction reduces the complexity presented to the end user but, in the end, all it is doing is hiding the complexity that’s always existed. Typically, abstraction actually increases overall system complexity. And as our networks and system become ever more complex, it takes more thought and insight into “how things really work” to make sure we are looking for the right things, in the right places, to confidently know the state of the systems we are responsible for.

The title of this post raises an important question and one that seems to be on the mind of everyone who works in an infrastructure role these days. How are automation and orchestration going to transform my role as an infrastructure engineer? APIs seem to be all the rage, and vendors are tripping over themselves to integrate northbound APIs, southbound APIs, dynamic/distributed workloads, and abstraction layers anywhere they can. What does it all mean for you and the way you run your infrastructure?


My guess is that it probably won’t impact your role all that much.


I can see the wheels turning already. Some of you are vehemently disagreeing with me and want to stop reading now, because you see every infrastructure engineer only interacting with an IDE, scripting all changes/deployments. Others of you are looking for validation for holding on to the familiar processes and procedures that have been developed over the years. Unfortunately, I think both of those approaches are flawed. Here’s why:


Do you need to learn to code? To some degree, yes! You need to learn to script and automate those repeatable tasks that you can save time being run via script. The thing is, this isn’t anything new. If you want to be an excellent infrastructure engineer, you’ve always needed to know how to script and automate tasks. If anything, this newly minted attention being placed on automation should make it less of an effort to achieve (anyone who’s had to write expect scripts for multiple platforms should be nodding their head at this point). A focus on automation doesn’t mean that you just now need to learn how to use these tools. It means that vendors are finally realizing the value and making this process easier for the end-user. If you don’t know how to script, you should pick a commonly used language and start learning it. I might suggest Python or PowerShell if you aren’t familiar with any languages just yet.


Do I need to re-tool and become a programmer?  Absolutely not! Programming is a skill in and of itself, and infrastructure engineers will not need to be full-fledged programmers as we move forward. By all means, if you want to shift careers, go for it. We need full-time programmers who understand how infrastructure really works. But, automation and orchestration aren’t going to demand that every engineer learn how to write their own compilers, optimize their code for obscure processors, or make their code operate across multiple platforms. If you are managing infrastructure through scripting, and you aren’t the size of Google, that level of optimization and reusability isn’t going to be necessary to see significant optimization of your processes. You won’t be building the platforms, just tweaking them to do your will.


Speaking of platforms, this is the main reason why I don’t think your job is really going to change that much. We’re in the early days of serious infrastructure automation. As the market matures, vendors are going to be offering more and more advanced orchestration platforms as part of their product catalog. You are likely going to interface with these platforms via a web front end or a CLI, not necessarily through scripts or APIs. Platforms will have easy-to-use front ends with an engine on the back end that does the scripting and API calls for you. Think about this in the terms of Amazon AWS. Their IaaS products are highly automated and orchestrated, but you primarily control that automation from a web control panel. Sure, you can dig in and start automating some of your own calls, but that isn’t really required by the large majority of organizations. This is going to be true for on-premises equipment moving forward as well.


Final Thoughts


Is life for the infrastructure engineer going to drastically change because of a push for automation? I don’t think so. That being said, scripting is a skill that you need in your toolbox if you want to be a serious infrastructure engineer. The nice thing about automation and scripting is that it requires predictability and standardization of your configurations, and this leads to stable and predictable systems. On the other hand, if scripting and automation sound like something you would enjoy doing as the primary function of your job, the market has never been better or had more opportunities to do it full time. We need people writing code who have infrastructure management experience.


Of course, I could be completely wrong about all of this, and I would love to hear your thoughts in the comments either way.

There’s no question that trends in IT change on a dime and have done so for as long as technology has been around. The hallmark of a truly talented IT professional is the ability to adapt to those ever-present changes and remain relevant, regardless of the direction that the winds of hype are pushing us this week. It’s challenging and daunting at times, but adaptation is just part of the gig in IT engineering.


Where are we headed?


Cloud (Public) - Organizations are adopting public cloud services in greater numbers than ever. Whether it be Platform, Software, or Infrastructure as a Service, the operational requirements within enterprises are being reduced by relying on third parties to run critical components of the infrastructure. To realize cost savings in this model, operational (aka employee) and capital (aka equipment) costs must be reduced for on-premises services.


Cloud (Private) - Due to the popularity of public cloud options, and the normalization of the dynamic/flexible infrastructure that they provide, organizations are demanding that their on-premises infrastructure operate in a similar fashion. Or in the case of hybrid cloud, operate in a coordinated fashion with public cloud resources. This means automation and orchestration are playing much larger roles in enterprise architectures. This also means that the traditional organizational structures of highly segmented skill specialties (systems, database, networking, etc.) are being consumed by engineers who have experience in multiple disciplines.


Commoditization - When I reference commoditization here, it isn’t about the ubiquity and standardization of hardware platforms. Instead, I’m talking about the way that enterprise C-level leadership is looking at technology within the organization. Fewer organizations are investing in true engineering/architecture resources, and instead are bringing those services in either via utilization of cloud infrastructure, or bringing this skill set on through consultation. The days of working your way from a help desk position up to a network architecture position within one organization are slowly fading away.


So what does all of this mean for you?

It’s time to skill up. Focusing on one specialty and mastering only that isn’t going to be as viable a career path as it once was. Breadth of knowledge across disciplines is going to help you stand out because organizations are starting to look for people who can help them manage their cloud initiatives. Take some time to learn how the large public cloud providers like AWS, Azure, and Google Compute operate and how to integrate organizations into them. Spend some time learning how hyperconverged platforms work and integrate into legacy infrastructures. Finally, learn how to script in an interpreted (non-compiled) programming language. Don’t take that as advice to change career paths and become a programmer.  That line of thinking is a bit overhyped in my opinion. However, you should be able to do simple automation tasks on your own, and modify other people’s code to do what you need. All of these skills are going to be highly sought after as enterprises move into more cloud-centric infrastructures.


Don’t forget a specialty. While a broad level of knowledge is going to be prerequisite as we go forward, I still believe having a specialty in one or two specifics areas will help from a career standpoint. We still need experts, we just need those experts to know more than just their one little area of the infrastructure. Pick something you are good at and enjoy, and then learn it as deeply as you possibly can, all while keeping up with the infrastructure that touches/uses your specialty. Sounds easy, right?


Consider what your role will look like in 5-10 years. This speaks to the commoditization component of the trends listed above. If your aspiration is to work your way into an engineering or architecture-style role, the enterprise may not be the best place to do that as we move forward. My prediction is that we are going to see many of those types of roles move to cloud infrastructure companies, web scale organizations, resellers/consultants, and the technology vendors themselves. It’s going to get harder to find organizations that want to custom-design their infrastructure to match and enhance their business objectives, instead opting to keep administrative-level technicians on staff and leave the really fun work to outside entities. Keep this in mind when plotting your career trajectory.


Do nothing. This is bad advice, and not at all what I would recommend, but it is an equally viable path. Organizations don’t turn on a dime (even though our tech likes to), so you probably have 5 to 10 years of coasting ahead. You might be able to eek out 15 if you can find an organization that is really change averse and stubbornly attached to their own hardware. It won’t last forever, though, and if you aren’t retiring before the end of that coasting period, you’re likely going to find yourself in a very bad spot.


Final thoughts


I believe the general trend of enterprises viewing technology as a commodity, rather than a potential competitive advantage, is foolish and shortsighted. Technology has the ability to streamline, augment, and enhance the business processes that directly face a business’ customers. That being said, ignoring business trends is a good way to find yourself behind the curve, and recognizing reality doesn’t necessarily indicate that you agree with the direction. Be cognizant of the way that businesses are employing technology and craft a personal growth strategy that allows you to remain relevant, regardless of what those future decisions may be. Cloud skills are king in the new technology economy, so don’t be left without them. Focusing on automation and orchestration will help you stay relevant in the future, as well. Whatever it is that you choose to do, continue learning and challenging yourself and you should do just fine.

Network performance monitoring feels a bit like a moving target sometimes.  Just as we normalize processes and procedures for our monitoring platforms, some new technology comes around that turns things upside down again. The most recent change that seems to be forcing us to re-evaluate our monitoring platforms is cloud computing and dynamic workloads. Many years ago, a service lived on a single server, or multiple if it was really big. It may or may not have had redundant systems, but ultimately you could count on any traffic to/from that box to be related to that particular service.


That got turned on its head with the widespread adoption of virtualization. We started hosting many logical applications and services on one physical box. Network performance to and from that one server was no longer tied to a specific application, but generally speaking, these workloads remained in place unless something dramatic happened, so we had time to troubleshoot and remediate issues when they arose.


In comes the cloud computing model, DevOps, and the idea of an ephemeral workload. Rather than have one logical server (physical or virtual), large enough to handle peak workloads when they come up and highly underutilized otherwise, we are moving toward containerized applications that are horizontally scaled. This complicates things when we start looking at how to effectively monitor these environments.


So What Does This Mean For Network Performance Monitoring?


The old way of doing things simply will not work any longer. Assuming that a logical service can be directly associated with a piece of infrastructure is no longer possible. We’re going to have to create some new methods, as well as enhance some old ones, to extract the visibility we need out of the infrastructure.


What Might That Look Like?


Application Performance Monitoring

This is something that we do today and Solarwinds has an excellent suite of tools to make it happen. What needs to change is our perspective on the data that these tools are giving us. In our legacy environments, we could poll an application every few minutes because not a lot changes between polling intervals. In the new model of system infrastructure, we have to assume that the application is scaled horizontally behind load balancers and that poll only touched one of many deployed instances. Application polling and synthetic transactions will need to happen far more frequently to give us a broader picture of performance across all instances of that application.



Rather than relying on polling to tell us about new configurations/instances/deployments on the network, we need the infrastructure to tell our monitoring systems about changes directly. Push rather than pull works much better when changes happen often and may be transient. We see a simple version of this in syslog today, but we need far better-automated intelligence to help us correlate events across systems and analyze the data coming into the monitoring platform. This data then will need to be associated with our traditional polling infrastructure to understand the impact of a piece of infrastructure going down or misbehaving. This likely will also include heuristic analysis to determine baseline operations and variations from that baseline. Manually reading logs every morning isn’t going to cut it as we move forward.


Traditional Monitoring

This doesn’t go away just because we’ve complicated things with a new form of application deployment. We still will need to keep monitoring our infrastructure for up/down, throughput, errors/discards, CPU, etc.


Final Thoughts

Information Technology is an ever-changing field, so it makes sense that we’re going to have to adjust our methods over time. Some of these changes will be in how we implement the tools we have today, and some of them are going to require our vendors to give us better visibility into the infrastructure we’re deploying. Either way, these types of challenges are what makes this work so much fun.

Malware is an issue that has been around since shortly after the start of computing and isn't something that is going to go away anytime soon. Over the years, the motivations, sophistication, and appearance have changed, but the core tenants remain the same. The most recent iteration of malware is called ransomware. Ransomware is software that takes control of the files on your computer, encrypts them with a password known only to the attacker, and then demands money (ransom) in order to unlock the files and return the system to normal.


Why is malware so successful? It’s all about trust. Users needs to be trusted to some degree so that they can complete the work that they need to do. Unfortunately, the more we entrust to the end-user, the more ability a bad piece of software has to inflict damage to the local system and all the systems it’s attached to. Limiting how much of your systems/files/network can be modified by the end-user can help mitigate this risk, but it has the side effect of inhibiting productivity and the ability to complete assigned work. Often it’s a catch-22 for businesses to determine how much security is enough, and malicious actors have been taking advantage of this balancing act to successfully implement their attacks. Now that these attacks have been systematically monetized, we're unlikely to see them diminish anytime soon.


So what can you do to move the balance back to your favor?


There are some well-established best practices that you should consider implementing in your systems if you haven't done so already. These practices are not foolproof, but if implemented well should mitigate all but the most determined of attackers and limit the scope of impact for those that do get through.


End-user Training: This has been recommended for ages and hasn't been the most effective tool in mitigating computer security risks. That being said, it still needs to be done. The safest way to mitigate the threat of malware is to avoid it altogether. Regularly training users to identify risky computing situations and how to avoid them is critical in minimizing risk to your systems.


Implement Thorough Filtering: This references both centralized and distributed filtering tools that are put in place to automatically identify threats and stop users from making a mistake before they can cause any damage. Examples of centralized filtering would be systems like web proxies, email spam/malware filtering, DNS filters, intrusion detection systems, and firewalls. Examples of local filtering include regularly updated anti-virus and anti-malware software. These filtering systems are only as good as the signatures they have though so regular definition updates are critical. Unfortunately, signatures can only be developed for known threats, so this too is not foolproof, but it’s a good tool to help ensure older/known versions/variants aren't making it through to end-users to be clicked on and run.


The Principle of Least Privilege: This is exactly what it sounds like. It is easy to say and hard to implement and is the balance between security and usability. If a user has administrative access to anything, they should never be logged in for day-to-day activities with that account and should be using the higher privileged account only when necessary. Users should only be granted write access to files and shares that they need write access to. Malware can't do anything with files it can only read. Implementing software that either whitelists only specific applications, or blacklists applications from being run from non-standard locations (temporary internet files, downloads folder, etc…) can go a long way in mitigating the threats that signature-based tools miss.


Patch Your Systems: This is another very basic concept, but something that is often neglected. Many pieces of malware make use of vulnerabilities that are already patched by the vendor. Yes, patches sometimes break things. Yes, distributing patches on a large network can be cumbersome and time consuming. You simply don't have an option, though. It needs to be done.


Have Backups: If you do get infected with ransomware, and it is successful in encrypting local or networked files, backups are going to come to the rescue. You are doing backups regularly, right? You are testing restores of those backups, right? It sounds simple, but so many find out that their backup system isn't working when they need it the most. Don't make that mistake.


Store Backups Offline: Backups that are stored online are at the same risk as the files they are backing up. Backups need to be stored on a removable media and then that media needs to be removed from the network and stored off-site. The more advanced ransomware variants look specifically to infect backup locations, as a functioning backup guarantees the attackers don't get paid. Don't let your last recourse become useless because you weren't diligent enough to move them off-line and off-site.


Final Thoughts


For those of you who have been in this industry for any time (yes, I'm talking to you graybeards of the bunch), you'll recognize the above list of action items as a simple set of good practices for a secure environment.  However, I would be willing to bet you've worked in environments (yes, plural) that haven't followed one or more of these recommendations due to a lack of discipline or a lack of proper risk assessment skills. Regardless, these tried and true strategies still work because the problem hasn't changed. It still comes down to the blast radius of a malware attack being directly correlated with the amount of privilege you grant the end-users in the organizations you manage. Help your management understand this tradeoff and the tools you have in your arsenal to manage it, and you can find the sweet spot between usability and security.

The only constant truth in our industry is that technology is always changing.  At times, it’s difficult to keep up with everything new that is being introduced while you stay active in working your day to day duties.  That challenge grows even harder if these new innovations diverge from the direction that the company you work for is heading.  Ignoring such change is a bad idea. Failing to keep up with where the market it heading is a recipe for stagnation and eventual irrelevance. So how do you keep up with these things when your employer doesn’t sponsor or encourage your education?


1) The first step is to come to the realization that you're going to need to spend some time outside of work learning new things. This can be difficult for a lot of reasons, but especially if you have a family or other outside obligations. Your career is a series of priorities though, and while it may/should not be the highest thing you prioritize, it has to at least be on the list.  Nobody is going to do the work for you, and if you don’t have the support of your organization, you’re going to have to carve out the time on your own.


2) Watch/listen/read/consume, a lot. Find people who are writing about the things you want to learn and read their blogs or books. Don’t just read their blogs, though. Add them to a program that harvests their RSS feeds so you are notified when they write new things. Find podcasts that address these new technologies and listen to them on your commute to/from work. Search YouTube to find people who are creating content around the things you want to learn. I have found the technology community to be very forthcoming with information about the things that they are working on. I’ve learned so much just from consuming the content that they create. These are very bright people sharing the things they are passionate about for free. The only thing it costs is your time. Some caution needs to be taken here though, as not everyone who creates content on the internet is right. Use the other resources to ask questions and validate the concepts learned from online sources.


3) Find others like you. The other thing that I have found about technology practitioners is that, contrary to the stereotype of awkward nerds, many love to be social and exist within an online community. There are people just like you hanging out on Twitter, in Slack groups, in forums, and other social places on the web. Engage with them and participate in the conversations. Part of the problem of new technology is that you don’t know what you don’t know. Something as simple as hearing an acronym/initialism that you haven’t heard before could lead you down a path of discovery and learning. Ask questions and see what comes back. Share your frustrations and see if others have found ways around them. The online community of technology practitioners is thriving. Don't miss the opportunity to join in and learn something from them.


4) Read vendor documentation. I know this one sounds dry, but it is often a good source for guidance on how a new technology is being implemented. Often it will include the fundamental concepts you need to know in order to implement whatever it is that you are learning about. Take terms that you don’t understand and search for them.  Look for key limitations or caveats in the way a vendor implements a technology and it will tell you about its limitations. You do have to read between the lines a bit, and filter out the vendor-specific stuff (unless you are looking to learn about a specific vendor), but this content is often free and incredibly comprehensive.


5) Pay for training. If all of the above doesn’t round out what you need to learn, you’re just going to have to invest in yourself and pay for some training. This can be daunting as week-long onsite courses can cost thousands of dollars. I wouldn’t recommend that route unless you absolutely need to. Take advantage of online computer-based training (CBT) from sites like CBT Nuggets, Pluralsight, and ITProTV. These sites typically have reasonable monthly or yearly subscription fees so you can consume as much content as your heart desires.


6) Practice, practice, practice. This is true for any learning type, but especially true when you’re going it alone. If at all possible, build a lab of what you’re trying to learn.  Utilize demo licenses and emulated equipment if you have to. Build virtual machines with free hypervisors like KVM so you can get hands-on experience with what you’re trying to learn. A lab is the only place where you are going to know for sure if you know your stuff or not. Build it, break it, fix it, and then do it all again. Try it from a different angle and test your assumptions. You can read all the content in the world, but if you can’t apply it, it isn’t going to help you much.


Final Thoughts


Independent learning can be time consuming and, at times, costly. It helps to realize that any investment of time or money is an investment in yourself and the skills you can bring to your next position or employer.  If done right, you’ll earn it back many times over by the salary increases you’ll see by bringing new and valuable skills to the table.  However, nobody is going to do it for you, so get out there and start finding the places where you can take those next steps.

In the first post of this series we took a look at the problems that current generation WANs don’t have great answers for.  In the second post of the series we looked at how SD-WAN is looking to solve some of the problems and add efficiencies to your WAN.


If you haven’t had a chance to do so already, I would recommend starting with the linked posts above before moving on to the content below.


In this third and final post of the series we are going to take a look at what pitfalls an SD-WAN implementation might introduce and what are some items you should be considering if you’re looking to implement SD-WAN in your networks.


Proprietary Technology


We've grown accustom to having the ability to deploy openly developed protocols in our networks and SD-WAN takes a step backwards when it comes to openness.  Every vendor currently in the market has a significant level of lock in when it comes to their technology.  There is no interoperability between SD-WAN vendors and nothing on the horizon that looks like this fact will change.  If you commit to Company X's solution, you will need to implement the Company X product in every one of your offices if you want it to have SD-WAN level features available.  Essentially we are trading one type of lock in (service-provider run MPLS networks or private links) for another (SD-WAN overlay provider). You will need to make a decision about which lock-in is more limiting to your business and your budget.  Which lock-in is more difficult to replace, the MPLS underlay or the proprietary overlay?


Cost Savings


The cost savings argument is predicated on the idea that you will be willing to drop your expensive SLA backed circuits and replace them with generic Internet bandwidth.  What happens if you are unwilling to drop the SLA? Well the product isn't likely to come out as a cost savings at all.  There is no doubt that you will have access to features that you don't have now, but your organization will need to evaluate whether those features are worth the cost and lock-in that implementing SD-WAN incurs.


Vendor Survivability


We are approaching (might be over at this point) 20 vendors which are claiming to provide SD-WAN solutions. There is no question that it is one of the hottest networking trends at the moment and many vendors are looking to monopolize.  Where will they be in a year?  5 years? Will this fancy new solution that you implemented be bought out by a competitor, only to be discarded a year or two down the line?  How do you pick winners and losers in a highly contested market like the SD-WAN market currently is?  I can't guarantee an answer here, but there are some clear leaders in the space and a handful of companies that haven't fully committed to the vision.  If you are going to move forward with an SD-WAN deployment, you will need to factor in the organizational viability of the options you are considering.  Unfortunately, not every technical decision gets to be made on the merit of the technical solution alone.


Scare Factor


SD-WAN is a brave new world with a lot of concepts that network engineering tradition tells us to be cautious of.  Full automation and traffic re-rerouting has not been something that has been seamlessly implemented in previous iterations.  Controller based networks are a brand new concept on the wired side of the network. It's prudent for network engineers to take a hard look at the claims and verify the questionable ones before going all in.  SD-WAN vendors by and large seem willing to provide proof of concept and technical labs to convince you of their claims.  Take advantage of these programs and put the tech through its paces before committing on an SD-WAN strategy.


It's New


Ultimately, it's a new approach and nobody likes to play the role of guinea pig.  The feature set is constantly evolving and improving.  What you rely on today as a technical solution may not be available in future iterations of the product.  The tools you have to solve a problem a couple of months from now, may be wildly different than the tools you currently use.  These deployments also aren't as well tested as our traditional routing protocols.  There is a lot about SD-WAN that is new and needs to be proven.  Your tolerance for the risks of running new technology has to be taken into account when considering an SD-WAN deployment.


Final Thoughts


It’s undeniable that there are problems in our current generation of networks that traditional routing protocols haven’t effectively solved for us.  The shift from a localized perspective on decision making to a controller based network design is significant enough to be able to solve some of these long standing and nagging issues.  While the market is new, and a bit unpredictable, there is little doubt that controller based networking is the direction things are moving both in the data center and the WAN.  Also, if you look closely enough, you’ll find that these technologies don’t differ wildly from the controller based wireless networks many organizations have been running for years.  Because of this I think it makes a lot of sense to pay close attention to what is happening in the SD-WAN space and consider what positive or negative impacts an implementation could bring to your organization.

This is the second installment of a three-part series discussing SD-WAN (Software Defined WAN), what current problems it may solve for your organization, and what new challenges it may introduce. Part 1 of the series, which discusses some of the drawbacks and challenges of our current WANs, can be found HERE.  If you haven’t already, I would recommend reading that post before proceeding.


Great!  Now that everyone has a common baseline on where we are now, the all-important question is…


Where are we going?


This is where SD-WAN comes into the picture.  SD-WAN is a generic term for a controller driven and orchestrated Wide Area Network.  I say it’s generic because there is no definition of what does and does not constitute an SD-WAN solution and as can be expected, every vendor approaches these challenges from their own unique perspectives and strengths.  While the approaches do have unique qualities about them, the reality is that they are all solving for the same set of problems and consequently have been coming to form a set of similar solutions.  Below we are going to take a look at these “shared” SD-WAN concepts on how these changes in functionality can solve some of the challenges we’ve been facing on the WAN for a long time.


Abstraction – This is at the heart of SD-WAN solutions even though abstraction in and of itself isn't a solution to any particular problem. Think of abstraction like you think about system virtualization.  All the parts and pieces remain but we separate the logic/processing (VM/OS) from the hardware (Server).  Although in the WAN scenario we are separating the logic (routing, path selection) from the underlying hardware (WAN links and traditional routing hardware).


The core benefit of abstraction is that it increases flexibility in route decisions and reduces dependency on any one piece of underlying infrastructure.  All of the topics below build upon this idea of separating the intelligence (overlay) from the devices responsible for forwarding that traffic (underlay).  Additionally, abstraction reduces the impact of any one change in the underlay, again drawing parallels from the virtualization of systems architecture.  Changing circuit providers or routing hardware in our current networks can be a time consuming, costly and challenging tasks.  When those components exist as part of an underlay, migration from one platform to another, or one circuit provider to another, becomes a much simpler task.


Centralized Perspective - Unlike our current generation of WANs, SD-WAN networks almost universally utilize some sort of controller technology.  This centrally located controller is able to collect information on the entirety of the network and intelligently influence traffic based on analysis of the performance of all hardware and links.  These decisions then get pushed down to local routing devices to enforce the optimal routing policy determined by the controller.


This is a significant shift from what we are doing today as each and every routing device is making decisions off of a very localized view of the network and only is only aware of performance characteristics for the links it is directly connected to.  By being able to see trouble many hops away from the source of the traffic, a centralized controller can route around it at the most opportune location, providing the best possible service level for the data flow.


Application Awareness - Application identification isn't exactly new to router platforms.  What is new, is the ability to make dynamic routing decisions based off of specific applications, or even sub-components of those applications.  Splitting traffic between links based off of business criticality and ancillary business requirements has long been a request of both small and large shops alike.  Implementing these policy based routing decisions in the current generation networks has almost always resulted in messy and unpredictable results.


Imagine being able to route SaaS traffic directly out to the internet (since we trust it and it doesn’t require additional security filtering), file sharing across your internet based IPSec VPN (since performance isn’t as critical as other applications), and voice/video across an MPLS line with an SLA (since performance, rather than overall bandwidth, are more important).  Now add 5% packet loss on your MPLS link… SD-WAN solutions will be able to dynamically shift your voice/video traffic to IPSec VPN since overall performance is better on that path.  Application centric routing, policy, and performance guarantees are significant advancements made possible with a centralized controller and abstraction.


Real Time Error Detection/Telemetry – One of the most frustrating conditions to work around on today’s networks is a brown out type condition that doesn’t bring down a routing protocol neighbor relationship.  While a visible look at the interfaces will tell you there is a problem, if the thresholds aren’t set correctly manual intervention is required to route around such a problem.  Between the centralized visibility of both sides of the link and the collection/analysis of real time telemetry data provided by a controller based architecture, SD-WAN solutions have the ability to route around these brown out conditions dynamically.  Below are three different types of error conditions one might encounter on a network and how current networks and SD-WAN networks might react to them.  This comparison is done based off a branch with 2 unique uplink paths.


Black Out:  One link fully out of service.

Current Routers:  This is handled well by current equipment and protocols.  Traffic will fail over to the backup link and only return once service has been restored.

SD-WAN:  SD-WAN handles this in identical fashion.


Single Primary Link Brown Out:  Link degradation (packet loss or jitter) is occurring on only one of multiple links.

Current Routers: Traditional networks don't handle this condition well until the packet loss is significant enough for routing protocols to fail over.  All traffic will continue to use the degraded link, even with a non-degraded link available for use.

SD-WAN:  SD-WAN solutions have the advantage of centralized perspective and can detect these conditions without additional overhead of probe traffic.  Critical traffic can be moved to stable links, and if allowed in the policy, traffic more tolerant of brown out conditions can still use the degraded link.


Both Link Brown Out:  All available links are degraded.

Current Routers:  No remediation possible.  Traffic will traverse the best available link that can maintain a routing neighbor relationship.

SD-WAN:  Some SD-WAN solutions provide answers even for this condition.  Through a process commonly referred to as Forward Error Correction, traffic is duplicated and sent out all of your degraded links.  A small buffer is maintained on the receiving end and packets are re-ordered once they are received.  This can significantly improve application performance even across multiple degraded links.


Regardless of the specific condition, the addition of a controller to the network gives a centralized perspective and the ability to identify and make routing decisions based on real-time performance data.


Efficient Use of Resources - This is the kicker, and I say that because all of the above solutions solve truly technical problems.  This one hits home where most executive care the most.  Due to the active/passive nature of current networks, companies who need redundancy are forced to purchase double their required bandwidth capacity and leave 50% of it idle when conditions are nominal.  Current routing protocols just doesn't have the ability to easily utilize disparate WAN capacity and then fall back to a single link when necessary.


Is it better to pay for 200% of the capacity you need for the few occasions when you need it, or pay for 100% of what you need and deal with only 50% capacity when there is trouble?


To add to this argument, many SD-WAN providers are so confident in their solutions that they pitch being able to drop more expensive SLA based circuits (MPLS/Direct) in favor of far cheaper generic internet bandwidth.  If you are able to procure 10 times the bandwidth, split across 3 diverse providers, would your performance be better than a smaller circuit with guaranteed bandwidth even with the anticipated oversubscription?  These claims need to be proven out but the intelligence that the controller based overlay network gives you could very well prove to negate the need to pay for provider based performances promises.


Reading the above list could likely convince someone that SD-WAN is the WAN panacea we’ve all been waiting for.  But, like all technological advancement, it’s never quite that easy.  Controller orchestrated WANs make a lot of sense in solving some of the more difficult questions we face with our current routing protocols but no change comes without its own risks and challenges.  Keep a look out for the third and final installment in this series where we will address the potential pitfalls associated with implementing an SD-WAN solution and discuss some ideas on how you might mitigate them.

In the world of networking, you would be hard pressed to find a more pervasive and polarizing topic than that of SDN. The concept of controller-based, policy-driven, and application-focused networks has owned the headlines for several years as network vendors have attempted to create solutions that allow everyone to operate with the optimization and automation as the large Web-scale companies do. The hype started in and around data center networks, but over the past year or so, the focus has sharply shifted to the WAN, for good reason.


In this three-part series we are going to take a look at the challenges of current WAN technologies, what SD-WAN brings to the table, and what some drawbacks may be in pursuing an SD-WAN strategy for your network.


Where Are We Now?


In the first iteration of this series, we’re going to identify and discuss some of the limitations in and around WAN technology in today’s networks. The lists below are certainly not comprehensive, but speak to the general issues faced by network engineers when deploying, maintaining, and troubleshooting enterprise WANs.


Perspective – The core challenge in creating a policy-driven network is perspective. For the most part, routers in today's networks make decisions independent of the state of peer devices. While there certainly are protocols that share network state information (routing protocols being the primary example), actions based off of this exchanged information are exclusively determined through the lens of the router's localized perspective of the environment.


This can cause non-trivial challenges in the coordination of desired traffic behavior, especially for patterns that may not follow the default/standard behavior that a protocol may choose for you. Getting every router to make uniform decisions, each utilizing a different perspective, can be a difficult challenge and add significant complexity depending on the policy trying to be enforced.


Additionally, not every protocol shares every piece of information, so it is entirely possible that one router is making decisions off of considerably different information than what other routers may be using.


Application Awareness - Routing in current generation network is remarkably simple. A router considers whether or not it is aware of the destination prefix, and if so, forwards the packet on to the next hop along the path. Information outside of the destination IP address is not considered when determining path selection.  Deeper inspection of the packet payload is possible on most modern routers, but that information does not play into route selection decisions. Due to this limitation in how we identify forwarding paths, it is incredibly difficult to differentiate routing policy based off of the application traffic being forwarded.


Error Detection/Failover – Error detection and failover in current generation routing protocols is a fairly binary process. Routers exchange information with their neighbors, and if they don’t hear from them in some sort of pre-determined time window, they tear down the neighbor relationship and remove the information learned from that peer. Only at that point will a router choose to take what it considers to be an inferior path. This solution works well for black-out style conditions, but what happens when there is packet loss or significant jitter on the link? The answer is that current routing protocols do not take these conditions into consideration when choosing an optimal path. It is entirely possible for a link to have 10% packet loss, which significantly impact voice calls, and have the router plug along like everything is okay since it never loses connection with its neighbor long enough to tear down the connection and choose an alternate path. Meanwhile, a perfectly suitable alternative may be sitting idle, providing no value to the organization.


Load Balancing/Efficiency - Also inherent in the way routing protocols choose links is the fact that all protocols are looking to identify the single best path (or paths, if they are equal cost) and make it active, leaving all other paths passive until the active link(s) fail. EIGRP could be considered an exception to this rule as it allows for unequal cost load balancing, but even that is less than ideal since it won’t detect brown-out conditions on a primary link and move all traffic to the secondary. This means that organizations have to purchase far more bandwidth than necessary to ensure each link, passive or active, has the ability to support all traffic at any point. Since routing protocols do not have the ability to load balance based off of application characteristics, load balancing and failover is an all or nothing proposition.


As stated previously, the above list is just a quick glance at some of the challenges faced in designing and managing the WAN in today’s enterprise network.  In the second part of this series we are going to take a look at what SD-WAN does that helps remediate many of the above challenges.  Also keep your eyes peeled for Part 3, which will close out the series by identifying some potential challenges surrounding SD-WAN solutions, and some final thoughts on how you might take your next step to improving your enterprise’s WAN.

Practitioners in nearly every technology field are facing revolutionary changes in the way systems and networks are built. Change, by itself, really isn't all that interesting. Those among us who have been doing this a while will recognize that technological change is one of the few reliable constants. What is interesting, however, is how things are changing.


Architects, engineers, and the vendors that produce gear for them have simply fallen in love with the concept of abstraction. The abstraction flood gates have metaphorically flown open following the meteoric rise of the virtual machine in enterprise networks. As an industry, we have watched the abstraction of the operating system -- from the hardware it lives on -- give us an amazing amount of flexibility in the way we deploy and manage our systems.  Now that the industry has fully embraced the concept of abstraction, we aim to implement it everywhere.


Breaking away from monolithic stack architecture


If we take a look at systems specifically, it used to be that the hardware, the operating system, and the application all existed as one logical entity.  If it was a large application, we might have components of the application split out across multiple hardware/OS combos, but generally speaking the stack was a unit. That single unit was something we could easily recognize and monitor as a whole. SNMP, while it has its limitations, has done a decent job of allowing operators to query the state of everything in that single stack.


Virtualization changed the game a bit as we decoupled the OS/Application from the hardware. While it may not have been the most efficient way of doing it, we could still monitor the VM like we used to when it was coupled with the hardware.  This is because we hadn't really changed the architecture.  Abstraction gave us some significant flexibility but our applications still relied on the same components, arranged in a similar pattern to the bare-metal stacks we started with.  The difference is that we now had two unique units where information collection was required, the hardware remained as it always had and the OS/Application became a secondary monitoring target.  It took a little more configuration but it didn't change the nature of the way we monitored the systems.


Cloud architecture changes everything


Then came the concept of cloud infrastructure. With it, developers began embracing the elastic nature of the cloud and started building their products to take advantage of it. Rather than sizing an application stack based off of guesstimates of the anticipated peak load, it can now be sized minimally and scaled out horizontally when needed by adding additional instances. Previously, just a handful of systems would have handled peak loads. Now those numbers could be dozens, or even hundreds of dynamically built systems scaled out based on demand. As the industry moves in this direction, our traditional means of monitoring simply do not provide enough information to let us know if our application is performing as expected.


The networking story is similar in a lot of ways. While networking has generally been resistant to change over the past couple of decades, the need for dynamic/elastic infrastructure is forcing networks to take several evolutionary steps rather quickly.  In order to support the cloud models that application developers have embraced, the networks of tomorrow will be built with application awareness, self-programmability, and moment-in-time best path selection as core components.


Much like in the systems world, abstraction is one of the primary keys to achieving this flexibility. Whether the new model of networks is built upon new protocols, or overlays of existing infrastructure, the traditional way of statically configuring networks is coming to an end. Rather than having statically assigned primary, secondary, and tertiary paths, networks will balance traffic based off of business policy, link performance, and application awareness. Fault awareness will be built in, and traffic flows will be dynamically routed around trouble points in the network. Knowing the status of the actual links themselves will become less important, much like physical hardware that applications use. Understanding network performance will require understanding the actual performance of the packet flows that are utilizing the infrastructure.


At the heart of the matter, the end goal appears to be ephemeral state of both network path selection as well as systems architecture.


So how does this change monitoring?


Abstraction inherently makes application and network performance harder to analyze. In the past, we could monitor hardware state, network link performance, CPU, memory, disk latency, logs, etc. and come up with a fairly accurate picture of what was going on with the applications using those resources. Distributed architectures negate the correlation between a single piece of underlying infrastructure and the applications that use it.  Instead, synthetic application transactions and real-time performance data will need to be used to determine what application performance really looks like. Telemetry is a necessary component for monitoring next generation system and network architectures.


Does this mean that SNMP is going away?


While many practitioners wouldn't exactly shed a tear if they never needed to touch SNMP again, the answer is no. We still will have a need to monitor the underlying infrastructure even though it no longer gives us the holistic view that it once did. The widespread use of SNMP as the mechanism for monitoring infrastructure means it will remain a component of monitoring strategies for some time to come. Next generation monitoring systems will need to integrate the traditional SNMP methodologies with deeper levels of real-time application testing and awareness to ensure operators can remain aware of the environments they are responsible for managing.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.