Skip navigation

Geek Speak

9 Posts authored by: jordan.martin

The title of this post raises an important question and one that seems to be on the mind of everyone who works in an infrastructure role these days. How are automation and orchestration going to transform my role as an infrastructure engineer? APIs seem to be all the rage, and vendors are tripping over themselves to integrate northbound APIs, southbound APIs, dynamic/distributed workloads, and abstraction layers anywhere they can. What does it all mean for you and the way you run your infrastructure?

 

My guess is that it probably won’t impact your role all that much.

 

I can see the wheels turning already. Some of you are vehemently disagreeing with me and want to stop reading now, because you see every infrastructure engineer only interacting with an IDE, scripting all changes/deployments. Others of you are looking for validation for holding on to the familiar processes and procedures that have been developed over the years. Unfortunately, I think both of those approaches are flawed. Here’s why:

 

Do you need to learn to code? To some degree, yes! You need to learn to script and automate those repeatable tasks that you can save time being run via script. The thing is, this isn’t anything new. If you want to be an excellent infrastructure engineer, you’ve always needed to know how to script and automate tasks. If anything, this newly minted attention being placed on automation should make it less of an effort to achieve (anyone who’s had to write expect scripts for multiple platforms should be nodding their head at this point). A focus on automation doesn’t mean that you just now need to learn how to use these tools. It means that vendors are finally realizing the value and making this process easier for the end-user. If you don’t know how to script, you should pick a commonly used language and start learning it. I might suggest Python or PowerShell if you aren’t familiar with any languages just yet.

 

Do I need to re-tool and become a programmer?  Absolutely not! Programming is a skill in and of itself, and infrastructure engineers will not need to be full-fledged programmers as we move forward. By all means, if you want to shift careers, go for it. We need full-time programmers who understand how infrastructure really works. But, automation and orchestration aren’t going to demand that every engineer learn how to write their own compilers, optimize their code for obscure processors, or make their code operate across multiple platforms. If you are managing infrastructure through scripting, and you aren’t the size of Google, that level of optimization and reusability isn’t going to be necessary to see significant optimization of your processes. You won’t be building the platforms, just tweaking them to do your will.

 

Speaking of platforms, this is the main reason why I don’t think your job is really going to change that much. We’re in the early days of serious infrastructure automation. As the market matures, vendors are going to be offering more and more advanced orchestration platforms as part of their product catalog. You are likely going to interface with these platforms via a web front end or a CLI, not necessarily through scripts or APIs. Platforms will have easy-to-use front ends with an engine on the back end that does the scripting and API calls for you. Think about this in the terms of Amazon AWS. Their IaaS products are highly automated and orchestrated, but you primarily control that automation from a web control panel. Sure, you can dig in and start automating some of your own calls, but that isn’t really required by the large majority of organizations. This is going to be true for on-premises equipment moving forward as well.

 

Final Thoughts

 

Is life for the infrastructure engineer going to drastically change because of a push for automation? I don’t think so. That being said, scripting is a skill that you need in your toolbox if you want to be a serious infrastructure engineer. The nice thing about automation and scripting is that it requires predictability and standardization of your configurations, and this leads to stable and predictable systems. On the other hand, if scripting and automation sound like something you would enjoy doing as the primary function of your job, the market has never been better or had more opportunities to do it full time. We need people writing code who have infrastructure management experience.

 

Of course, I could be completely wrong about all of this, and I would love to hear your thoughts in the comments either way.

There’s no question that trends in IT change on a dime and have done so for as long as technology has been around. The hallmark of a truly talented IT professional is the ability to adapt to those ever-present changes and remain relevant, regardless of the direction that the winds of hype are pushing us this week. It’s challenging and daunting at times, but adaptation is just part of the gig in IT engineering.

 

Where are we headed?

 

Cloud (Public) - Organizations are adopting public cloud services in greater numbers than ever. Whether it be Platform, Software, or Infrastructure as a Service, the operational requirements within enterprises are being reduced by relying on third parties to run critical components of the infrastructure. To realize cost savings in this model, operational (aka employee) and capital (aka equipment) costs must be reduced for on-premises services.

 

Cloud (Private) - Due to the popularity of public cloud options, and the normalization of the dynamic/flexible infrastructure that they provide, organizations are demanding that their on-premises infrastructure operate in a similar fashion. Or in the case of hybrid cloud, operate in a coordinated fashion with public cloud resources. This means automation and orchestration are playing much larger roles in enterprise architectures. This also means that the traditional organizational structures of highly segmented skill specialties (systems, database, networking, etc.) are being consumed by engineers who have experience in multiple disciplines.

 

Commoditization - When I reference commoditization here, it isn’t about the ubiquity and standardization of hardware platforms. Instead, I’m talking about the way that enterprise C-level leadership is looking at technology within the organization. Fewer organizations are investing in true engineering/architecture resources, and instead are bringing those services in either via utilization of cloud infrastructure, or bringing this skill set on through consultation. The days of working your way from a help desk position up to a network architecture position within one organization are slowly fading away.

 

So what does all of this mean for you?

It’s time to skill up. Focusing on one specialty and mastering only that isn’t going to be as viable a career path as it once was. Breadth of knowledge across disciplines is going to help you stand out because organizations are starting to look for people who can help them manage their cloud initiatives. Take some time to learn how the large public cloud providers like AWS, Azure, and Google Compute operate and how to integrate organizations into them. Spend some time learning how hyperconverged platforms work and integrate into legacy infrastructures. Finally, learn how to script in an interpreted (non-compiled) programming language. Don’t take that as advice to change career paths and become a programmer.  That line of thinking is a bit overhyped in my opinion. However, you should be able to do simple automation tasks on your own, and modify other people’s code to do what you need. All of these skills are going to be highly sought after as enterprises move into more cloud-centric infrastructures.

 

Don’t forget a specialty. While a broad level of knowledge is going to be prerequisite as we go forward, I still believe having a specialty in one or two specifics areas will help from a career standpoint. We still need experts, we just need those experts to know more than just their one little area of the infrastructure. Pick something you are good at and enjoy, and then learn it as deeply as you possibly can, all while keeping up with the infrastructure that touches/uses your specialty. Sounds easy, right?

 

Consider what your role will look like in 5-10 years. This speaks to the commoditization component of the trends listed above. If your aspiration is to work your way into an engineering or architecture-style role, the enterprise may not be the best place to do that as we move forward. My prediction is that we are going to see many of those types of roles move to cloud infrastructure companies, web scale organizations, resellers/consultants, and the technology vendors themselves. It’s going to get harder to find organizations that want to custom-design their infrastructure to match and enhance their business objectives, instead opting to keep administrative-level technicians on staff and leave the really fun work to outside entities. Keep this in mind when plotting your career trajectory.

 

Do nothing. This is bad advice, and not at all what I would recommend, but it is an equally viable path. Organizations don’t turn on a dime (even though our tech likes to), so you probably have 5 to 10 years of coasting ahead. You might be able to eek out 15 if you can find an organization that is really change averse and stubbornly attached to their own hardware. It won’t last forever, though, and if you aren’t retiring before the end of that coasting period, you’re likely going to find yourself in a very bad spot.

 

Final thoughts

 

I believe the general trend of enterprises viewing technology as a commodity, rather than a potential competitive advantage, is foolish and shortsighted. Technology has the ability to streamline, augment, and enhance the business processes that directly face a business’ customers. That being said, ignoring business trends is a good way to find yourself behind the curve, and recognizing reality doesn’t necessarily indicate that you agree with the direction. Be cognizant of the way that businesses are employing technology and craft a personal growth strategy that allows you to remain relevant, regardless of what those future decisions may be. Cloud skills are king in the new technology economy, so don’t be left without them. Focusing on automation and orchestration will help you stay relevant in the future, as well. Whatever it is that you choose to do, continue learning and challenging yourself and you should do just fine.

Network performance monitoring feels a bit like a moving target sometimes.  Just as we normalize processes and procedures for our monitoring platforms, some new technology comes around that turns things upside down again. The most recent change that seems to be forcing us to re-evaluate our monitoring platforms is cloud computing and dynamic workloads. Many years ago, a service lived on a single server, or multiple if it was really big. It may or may not have had redundant systems, but ultimately you could count on any traffic to/from that box to be related to that particular service.

 

That got turned on its head with the widespread adoption of virtualization. We started hosting many logical applications and services on one physical box. Network performance to and from that one server was no longer tied to a specific application, but generally speaking, these workloads remained in place unless something dramatic happened, so we had time to troubleshoot and remediate issues when they arose.

 

In comes the cloud computing model, DevOps, and the idea of an ephemeral workload. Rather than have one logical server (physical or virtual), large enough to handle peak workloads when they come up and highly underutilized otherwise, we are moving toward containerized applications that are horizontally scaled. This complicates things when we start looking at how to effectively monitor these environments.

 

So What Does This Mean For Network Performance Monitoring?

 

The old way of doing things simply will not work any longer. Assuming that a logical service can be directly associated with a piece of infrastructure is no longer possible. We’re going to have to create some new methods, as well as enhance some old ones, to extract the visibility we need out of the infrastructure.

 

What Might That Look Like?

 

Application Performance Monitoring

This is something that we do today and Solarwinds has an excellent suite of tools to make it happen. What needs to change is our perspective on the data that these tools are giving us. In our legacy environments, we could poll an application every few minutes because not a lot changes between polling intervals. In the new model of system infrastructure, we have to assume that the application is scaled horizontally behind load balancers and that poll only touched one of many deployed instances. Application polling and synthetic transactions will need to happen far more frequently to give us a broader picture of performance across all instances of that application.

 

Telemetry

Rather than relying on polling to tell us about new configurations/instances/deployments on the network, we need the infrastructure to tell our monitoring systems about changes directly. Push rather than pull works much better when changes happen often and may be transient. We see a simple version of this in syslog today, but we need far better-automated intelligence to help us correlate events across systems and analyze the data coming into the monitoring platform. This data then will need to be associated with our traditional polling infrastructure to understand the impact of a piece of infrastructure going down or misbehaving. This likely will also include heuristic analysis to determine baseline operations and variations from that baseline. Manually reading logs every morning isn’t going to cut it as we move forward.

 

Traditional Monitoring

This doesn’t go away just because we’ve complicated things with a new form of application deployment. We still will need to keep monitoring our infrastructure for up/down, throughput, errors/discards, CPU, etc.

 

Final Thoughts

Information Technology is an ever-changing field, so it makes sense that we’re going to have to adjust our methods over time. Some of these changes will be in how we implement the tools we have today, and some of them are going to require our vendors to give us better visibility into the infrastructure we’re deploying. Either way, these types of challenges are what makes this work so much fun.

Malware is an issue that has been around since shortly after the start of computing and isn't something that is going to go away anytime soon. Over the years, the motivations, sophistication, and appearance have changed, but the core tenants remain the same. The most recent iteration of malware is called ransomware. Ransomware is software that takes control of the files on your computer, encrypts them with a password known only to the attacker, and then demands money (ransom) in order to unlock the files and return the system to normal.

 

Why is malware so successful? It’s all about trust. Users needs to be trusted to some degree so that they can complete the work that they need to do. Unfortunately, the more we entrust to the end-user, the more ability a bad piece of software has to inflict damage to the local system and all the systems it’s attached to. Limiting how much of your systems/files/network can be modified by the end-user can help mitigate this risk, but it has the side effect of inhibiting productivity and the ability to complete assigned work. Often it’s a catch-22 for businesses to determine how much security is enough, and malicious actors have been taking advantage of this balancing act to successfully implement their attacks. Now that these attacks have been systematically monetized, we're unlikely to see them diminish anytime soon.

 

So what can you do to move the balance back to your favor?

 

There are some well-established best practices that you should consider implementing in your systems if you haven't done so already. These practices are not foolproof, but if implemented well should mitigate all but the most determined of attackers and limit the scope of impact for those that do get through.

 

End-user Training: This has been recommended for ages and hasn't been the most effective tool in mitigating computer security risks. That being said, it still needs to be done. The safest way to mitigate the threat of malware is to avoid it altogether. Regularly training users to identify risky computing situations and how to avoid them is critical in minimizing risk to your systems.

 

Implement Thorough Filtering: This references both centralized and distributed filtering tools that are put in place to automatically identify threats and stop users from making a mistake before they can cause any damage. Examples of centralized filtering would be systems like web proxies, email spam/malware filtering, DNS filters, intrusion detection systems, and firewalls. Examples of local filtering include regularly updated anti-virus and anti-malware software. These filtering systems are only as good as the signatures they have though so regular definition updates are critical. Unfortunately, signatures can only be developed for known threats, so this too is not foolproof, but it’s a good tool to help ensure older/known versions/variants aren't making it through to end-users to be clicked on and run.

 

The Principle of Least Privilege: This is exactly what it sounds like. It is easy to say and hard to implement and is the balance between security and usability. If a user has administrative access to anything, they should never be logged in for day-to-day activities with that account and should be using the higher privileged account only when necessary. Users should only be granted write access to files and shares that they need write access to. Malware can't do anything with files it can only read. Implementing software that either whitelists only specific applications, or blacklists applications from being run from non-standard locations (temporary internet files, downloads folder, etc…) can go a long way in mitigating the threats that signature-based tools miss.

 

Patch Your Systems: This is another very basic concept, but something that is often neglected. Many pieces of malware make use of vulnerabilities that are already patched by the vendor. Yes, patches sometimes break things. Yes, distributing patches on a large network can be cumbersome and time consuming. You simply don't have an option, though. It needs to be done.

 

Have Backups: If you do get infected with ransomware, and it is successful in encrypting local or networked files, backups are going to come to the rescue. You are doing backups regularly, right? You are testing restores of those backups, right? It sounds simple, but so many find out that their backup system isn't working when they need it the most. Don't make that mistake.

 

Store Backups Offline: Backups that are stored online are at the same risk as the files they are backing up. Backups need to be stored on a removable media and then that media needs to be removed from the network and stored off-site. The more advanced ransomware variants look specifically to infect backup locations, as a functioning backup guarantees the attackers don't get paid. Don't let your last recourse become useless because you weren't diligent enough to move them off-line and off-site.

 

Final Thoughts

 

For those of you who have been in this industry for any time (yes, I'm talking to you graybeards of the bunch), you'll recognize the above list of action items as a simple set of good practices for a secure environment.  However, I would be willing to bet you've worked in environments (yes, plural) that haven't followed one or more of these recommendations due to a lack of discipline or a lack of proper risk assessment skills. Regardless, these tried and true strategies still work because the problem hasn't changed. It still comes down to the blast radius of a malware attack being directly correlated with the amount of privilege you grant the end-users in the organizations you manage. Help your management understand this tradeoff and the tools you have in your arsenal to manage it, and you can find the sweet spot between usability and security.

The only constant truth in our industry is that technology is always changing.  At times, it’s difficult to keep up with everything new that is being introduced while you stay active in working your day to day duties.  That challenge grows even harder if these new innovations diverge from the direction that the company you work for is heading.  Ignoring such change is a bad idea. Failing to keep up with where the market it heading is a recipe for stagnation and eventual irrelevance. So how do you keep up with these things when your employer doesn’t sponsor or encourage your education?

 

1) The first step is to come to the realization that you're going to need to spend some time outside of work learning new things. This can be difficult for a lot of reasons, but especially if you have a family or other outside obligations. Your career is a series of priorities though, and while it may/should not be the highest thing you prioritize, it has to at least be on the list.  Nobody is going to do the work for you, and if you don’t have the support of your organization, you’re going to have to carve out the time on your own.

 

2) Watch/listen/read/consume, a lot. Find people who are writing about the things you want to learn and read their blogs or books. Don’t just read their blogs, though. Add them to a program that harvests their RSS feeds so you are notified when they write new things. Find podcasts that address these new technologies and listen to them on your commute to/from work. Search YouTube to find people who are creating content around the things you want to learn. I have found the technology community to be very forthcoming with information about the things that they are working on. I’ve learned so much just from consuming the content that they create. These are very bright people sharing the things they are passionate about for free. The only thing it costs is your time. Some caution needs to be taken here though, as not everyone who creates content on the internet is right. Use the other resources to ask questions and validate the concepts learned from online sources.

 

3) Find others like you. The other thing that I have found about technology practitioners is that, contrary to the stereotype of awkward nerds, many love to be social and exist within an online community. There are people just like you hanging out on Twitter, in Slack groups, in forums, and other social places on the web. Engage with them and participate in the conversations. Part of the problem of new technology is that you don’t know what you don’t know. Something as simple as hearing an acronym/initialism that you haven’t heard before could lead you down a path of discovery and learning. Ask questions and see what comes back. Share your frustrations and see if others have found ways around them. The online community of technology practitioners is thriving. Don't miss the opportunity to join in and learn something from them.

 

4) Read vendor documentation. I know this one sounds dry, but it is often a good source for guidance on how a new technology is being implemented. Often it will include the fundamental concepts you need to know in order to implement whatever it is that you are learning about. Take terms that you don’t understand and search for them.  Look for key limitations or caveats in the way a vendor implements a technology and it will tell you about its limitations. You do have to read between the lines a bit, and filter out the vendor-specific stuff (unless you are looking to learn about a specific vendor), but this content is often free and incredibly comprehensive.

 

5) Pay for training. If all of the above doesn’t round out what you need to learn, you’re just going to have to invest in yourself and pay for some training. This can be daunting as week-long onsite courses can cost thousands of dollars. I wouldn’t recommend that route unless you absolutely need to. Take advantage of online computer-based training (CBT) from sites like CBT Nuggets, Pluralsight, and ITProTV. These sites typically have reasonable monthly or yearly subscription fees so you can consume as much content as your heart desires.

 

6) Practice, practice, practice. This is true for any learning type, but especially true when you’re going it alone. If at all possible, build a lab of what you’re trying to learn.  Utilize demo licenses and emulated equipment if you have to. Build virtual machines with free hypervisors like KVM so you can get hands-on experience with what you’re trying to learn. A lab is the only place where you are going to know for sure if you know your stuff or not. Build it, break it, fix it, and then do it all again. Try it from a different angle and test your assumptions. You can read all the content in the world, but if you can’t apply it, it isn’t going to help you much.

 

Final Thoughts

 

Independent learning can be time consuming and, at times, costly. It helps to realize that any investment of time or money is an investment in yourself and the skills you can bring to your next position or employer.  If done right, you’ll earn it back many times over by the salary increases you’ll see by bringing new and valuable skills to the table.  However, nobody is going to do it for you, so get out there and start finding the places where you can take those next steps.

In the first post of this series we took a look at the problems that current generation WANs don’t have great answers for.  In the second post of the series we looked at how SD-WAN is looking to solve some of the problems and add efficiencies to your WAN.

 

If you haven’t had a chance to do so already, I would recommend starting with the linked posts above before moving on to the content below.

 

In this third and final post of the series we are going to take a look at what pitfalls an SD-WAN implementation might introduce and what are some items you should be considering if you’re looking to implement SD-WAN in your networks.

 

Proprietary Technology

 

We've grown accustom to having the ability to deploy openly developed protocols in our networks and SD-WAN takes a step backwards when it comes to openness.  Every vendor currently in the market has a significant level of lock in when it comes to their technology.  There is no interoperability between SD-WAN vendors and nothing on the horizon that looks like this fact will change.  If you commit to Company X's solution, you will need to implement the Company X product in every one of your offices if you want it to have SD-WAN level features available.  Essentially we are trading one type of lock in (service-provider run MPLS networks or private links) for another (SD-WAN overlay provider). You will need to make a decision about which lock-in is more limiting to your business and your budget.  Which lock-in is more difficult to replace, the MPLS underlay or the proprietary overlay?

 

Cost Savings

 

The cost savings argument is predicated on the idea that you will be willing to drop your expensive SLA backed circuits and replace them with generic Internet bandwidth.  What happens if you are unwilling to drop the SLA? Well the product isn't likely to come out as a cost savings at all.  There is no doubt that you will have access to features that you don't have now, but your organization will need to evaluate whether those features are worth the cost and lock-in that implementing SD-WAN incurs.

 

Vendor Survivability

 

We are approaching (might be over at this point) 20 vendors which are claiming to provide SD-WAN solutions. There is no question that it is one of the hottest networking trends at the moment and many vendors are looking to monopolize.  Where will they be in a year?  5 years? Will this fancy new solution that you implemented be bought out by a competitor, only to be discarded a year or two down the line?  How do you pick winners and losers in a highly contested market like the SD-WAN market currently is?  I can't guarantee an answer here, but there are some clear leaders in the space and a handful of companies that haven't fully committed to the vision.  If you are going to move forward with an SD-WAN deployment, you will need to factor in the organizational viability of the options you are considering.  Unfortunately, not every technical decision gets to be made on the merit of the technical solution alone.

 

Scare Factor

 

SD-WAN is a brave new world with a lot of concepts that network engineering tradition tells us to be cautious of.  Full automation and traffic re-rerouting has not been something that has been seamlessly implemented in previous iterations.  Controller based networks are a brand new concept on the wired side of the network. It's prudent for network engineers to take a hard look at the claims and verify the questionable ones before going all in.  SD-WAN vendors by and large seem willing to provide proof of concept and technical labs to convince you of their claims.  Take advantage of these programs and put the tech through its paces before committing on an SD-WAN strategy.

 

It's New

 

Ultimately, it's a new approach and nobody likes to play the role of guinea pig.  The feature set is constantly evolving and improving.  What you rely on today as a technical solution may not be available in future iterations of the product.  The tools you have to solve a problem a couple of months from now, may be wildly different than the tools you currently use.  These deployments also aren't as well tested as our traditional routing protocols.  There is a lot about SD-WAN that is new and needs to be proven.  Your tolerance for the risks of running new technology has to be taken into account when considering an SD-WAN deployment.

 

Final Thoughts

 

It’s undeniable that there are problems in our current generation of networks that traditional routing protocols haven’t effectively solved for us.  The shift from a localized perspective on decision making to a controller based network design is significant enough to be able to solve some of these long standing and nagging issues.  While the market is new, and a bit unpredictable, there is little doubt that controller based networking is the direction things are moving both in the data center and the WAN.  Also, if you look closely enough, you’ll find that these technologies don’t differ wildly from the controller based wireless networks many organizations have been running for years.  Because of this I think it makes a lot of sense to pay close attention to what is happening in the SD-WAN space and consider what positive or negative impacts an implementation could bring to your organization.

This is the second installment of a three-part series discussing SD-WAN (Software Defined WAN), what current problems it may solve for your organization, and what new challenges it may introduce. Part 1 of the series, which discusses some of the drawbacks and challenges of our current WANs, can be found HERE.  If you haven’t already, I would recommend reading that post before proceeding.

 

Great!  Now that everyone has a common baseline on where we are now, the all-important question is…

 

Where are we going?

 

This is where SD-WAN comes into the picture.  SD-WAN is a generic term for a controller driven and orchestrated Wide Area Network.  I say it’s generic because there is no definition of what does and does not constitute an SD-WAN solution and as can be expected, every vendor approaches these challenges from their own unique perspectives and strengths.  While the approaches do have unique qualities about them, the reality is that they are all solving for the same set of problems and consequently have been coming to form a set of similar solutions.  Below we are going to take a look at these “shared” SD-WAN concepts on how these changes in functionality can solve some of the challenges we’ve been facing on the WAN for a long time.

 

Abstraction – This is at the heart of SD-WAN solutions even though abstraction in and of itself isn't a solution to any particular problem. Think of abstraction like you think about system virtualization.  All the parts and pieces remain but we separate the logic/processing (VM/OS) from the hardware (Server).  Although in the WAN scenario we are separating the logic (routing, path selection) from the underlying hardware (WAN links and traditional routing hardware).

 

The core benefit of abstraction is that it increases flexibility in route decisions and reduces dependency on any one piece of underlying infrastructure.  All of the topics below build upon this idea of separating the intelligence (overlay) from the devices responsible for forwarding that traffic (underlay).  Additionally, abstraction reduces the impact of any one change in the underlay, again drawing parallels from the virtualization of systems architecture.  Changing circuit providers or routing hardware in our current networks can be a time consuming, costly and challenging tasks.  When those components exist as part of an underlay, migration from one platform to another, or one circuit provider to another, becomes a much simpler task.

 

Centralized Perspective - Unlike our current generation of WANs, SD-WAN networks almost universally utilize some sort of controller technology.  This centrally located controller is able to collect information on the entirety of the network and intelligently influence traffic based on analysis of the performance of all hardware and links.  These decisions then get pushed down to local routing devices to enforce the optimal routing policy determined by the controller.

 

This is a significant shift from what we are doing today as each and every routing device is making decisions off of a very localized view of the network and only is only aware of performance characteristics for the links it is directly connected to.  By being able to see trouble many hops away from the source of the traffic, a centralized controller can route around it at the most opportune location, providing the best possible service level for the data flow.

 

Application Awareness - Application identification isn't exactly new to router platforms.  What is new, is the ability to make dynamic routing decisions based off of specific applications, or even sub-components of those applications.  Splitting traffic between links based off of business criticality and ancillary business requirements has long been a request of both small and large shops alike.  Implementing these policy based routing decisions in the current generation networks has almost always resulted in messy and unpredictable results.

 

Imagine being able to route SaaS traffic directly out to the internet (since we trust it and it doesn’t require additional security filtering), file sharing across your internet based IPSec VPN (since performance isn’t as critical as other applications), and voice/video across an MPLS line with an SLA (since performance, rather than overall bandwidth, are more important).  Now add 5% packet loss on your MPLS link… SD-WAN solutions will be able to dynamically shift your voice/video traffic to IPSec VPN since overall performance is better on that path.  Application centric routing, policy, and performance guarantees are significant advancements made possible with a centralized controller and abstraction.

 

Real Time Error Detection/Telemetry – One of the most frustrating conditions to work around on today’s networks is a brown out type condition that doesn’t bring down a routing protocol neighbor relationship.  While a visible look at the interfaces will tell you there is a problem, if the thresholds aren’t set correctly manual intervention is required to route around such a problem.  Between the centralized visibility of both sides of the link and the collection/analysis of real time telemetry data provided by a controller based architecture, SD-WAN solutions have the ability to route around these brown out conditions dynamically.  Below are three different types of error conditions one might encounter on a network and how current networks and SD-WAN networks might react to them.  This comparison is done based off a branch with 2 unique uplink paths.

 

Black Out:  One link fully out of service.

Current Routers:  This is handled well by current equipment and protocols.  Traffic will fail over to the backup link and only return once service has been restored.

SD-WAN:  SD-WAN handles this in identical fashion.

 

Single Primary Link Brown Out:  Link degradation (packet loss or jitter) is occurring on only one of multiple links.

Current Routers: Traditional networks don't handle this condition well until the packet loss is significant enough for routing protocols to fail over.  All traffic will continue to use the degraded link, even with a non-degraded link available for use.

SD-WAN:  SD-WAN solutions have the advantage of centralized perspective and can detect these conditions without additional overhead of probe traffic.  Critical traffic can be moved to stable links, and if allowed in the policy, traffic more tolerant of brown out conditions can still use the degraded link.

 

Both Link Brown Out:  All available links are degraded.

Current Routers:  No remediation possible.  Traffic will traverse the best available link that can maintain a routing neighbor relationship.

SD-WAN:  Some SD-WAN solutions provide answers even for this condition.  Through a process commonly referred to as Forward Error Correction, traffic is duplicated and sent out all of your degraded links.  A small buffer is maintained on the receiving end and packets are re-ordered once they are received.  This can significantly improve application performance even across multiple degraded links.

 

Regardless of the specific condition, the addition of a controller to the network gives a centralized perspective and the ability to identify and make routing decisions based on real-time performance data.

 

Efficient Use of Resources - This is the kicker, and I say that because all of the above solutions solve truly technical problems.  This one hits home where most executive care the most.  Due to the active/passive nature of current networks, companies who need redundancy are forced to purchase double their required bandwidth capacity and leave 50% of it idle when conditions are nominal.  Current routing protocols just doesn't have the ability to easily utilize disparate WAN capacity and then fall back to a single link when necessary.

 

Is it better to pay for 200% of the capacity you need for the few occasions when you need it, or pay for 100% of what you need and deal with only 50% capacity when there is trouble?

 

To add to this argument, many SD-WAN providers are so confident in their solutions that they pitch being able to drop more expensive SLA based circuits (MPLS/Direct) in favor of far cheaper generic internet bandwidth.  If you are able to procure 10 times the bandwidth, split across 3 diverse providers, would your performance be better than a smaller circuit with guaranteed bandwidth even with the anticipated oversubscription?  These claims need to be proven out but the intelligence that the controller based overlay network gives you could very well prove to negate the need to pay for provider based performances promises.

 

Reading the above list could likely convince someone that SD-WAN is the WAN panacea we’ve all been waiting for.  But, like all technological advancement, it’s never quite that easy.  Controller orchestrated WANs make a lot of sense in solving some of the more difficult questions we face with our current routing protocols but no change comes without its own risks and challenges.  Keep a look out for the third and final installment in this series where we will address the potential pitfalls associated with implementing an SD-WAN solution and discuss some ideas on how you might mitigate them.

In the world of networking, you would be hard pressed to find a more pervasive and polarizing topic than that of SDN. The concept of controller-based, policy-driven, and application-focused networks has owned the headlines for several years as network vendors have attempted to create solutions that allow everyone to operate with the optimization and automation as the large Web-scale companies do. The hype started in and around data center networks, but over the past year or so, the focus has sharply shifted to the WAN, for good reason.

 

In this three-part series we are going to take a look at the challenges of current WAN technologies, what SD-WAN brings to the table, and what some drawbacks may be in pursuing an SD-WAN strategy for your network.

 

Where Are We Now?

 

In the first iteration of this series, we’re going to identify and discuss some of the limitations in and around WAN technology in today’s networks. The lists below are certainly not comprehensive, but speak to the general issues faced by network engineers when deploying, maintaining, and troubleshooting enterprise WANs.

 

Perspective – The core challenge in creating a policy-driven network is perspective. For the most part, routers in today's networks make decisions independent of the state of peer devices. While there certainly are protocols that share network state information (routing protocols being the primary example), actions based off of this exchanged information are exclusively determined through the lens of the router's localized perspective of the environment.

 

This can cause non-trivial challenges in the coordination of desired traffic behavior, especially for patterns that may not follow the default/standard behavior that a protocol may choose for you. Getting every router to make uniform decisions, each utilizing a different perspective, can be a difficult challenge and add significant complexity depending on the policy trying to be enforced.

 

Additionally, not every protocol shares every piece of information, so it is entirely possible that one router is making decisions off of considerably different information than what other routers may be using.

 

Application Awareness - Routing in current generation network is remarkably simple. A router considers whether or not it is aware of the destination prefix, and if so, forwards the packet on to the next hop along the path. Information outside of the destination IP address is not considered when determining path selection.  Deeper inspection of the packet payload is possible on most modern routers, but that information does not play into route selection decisions. Due to this limitation in how we identify forwarding paths, it is incredibly difficult to differentiate routing policy based off of the application traffic being forwarded.

 

Error Detection/Failover – Error detection and failover in current generation routing protocols is a fairly binary process. Routers exchange information with their neighbors, and if they don’t hear from them in some sort of pre-determined time window, they tear down the neighbor relationship and remove the information learned from that peer. Only at that point will a router choose to take what it considers to be an inferior path. This solution works well for black-out style conditions, but what happens when there is packet loss or significant jitter on the link? The answer is that current routing protocols do not take these conditions into consideration when choosing an optimal path. It is entirely possible for a link to have 10% packet loss, which significantly impact voice calls, and have the router plug along like everything is okay since it never loses connection with its neighbor long enough to tear down the connection and choose an alternate path. Meanwhile, a perfectly suitable alternative may be sitting idle, providing no value to the organization.

 

Load Balancing/Efficiency - Also inherent in the way routing protocols choose links is the fact that all protocols are looking to identify the single best path (or paths, if they are equal cost) and make it active, leaving all other paths passive until the active link(s) fail. EIGRP could be considered an exception to this rule as it allows for unequal cost load balancing, but even that is less than ideal since it won’t detect brown-out conditions on a primary link and move all traffic to the secondary. This means that organizations have to purchase far more bandwidth than necessary to ensure each link, passive or active, has the ability to support all traffic at any point. Since routing protocols do not have the ability to load balance based off of application characteristics, load balancing and failover is an all or nothing proposition.

 

As stated previously, the above list is just a quick glance at some of the challenges faced in designing and managing the WAN in today’s enterprise network.  In the second part of this series we are going to take a look at what SD-WAN does that helps remediate many of the above challenges.  Also keep your eyes peeled for Part 3, which will close out the series by identifying some potential challenges surrounding SD-WAN solutions, and some final thoughts on how you might take your next step to improving your enterprise’s WAN.

Practitioners in nearly every technology field are facing revolutionary changes in the way systems and networks are built. Change, by itself, really isn't all that interesting. Those among us who have been doing this a while will recognize that technological change is one of the few reliable constants. What is interesting, however, is how things are changing.

 

Architects, engineers, and the vendors that produce gear for them have simply fallen in love with the concept of abstraction. The abstraction flood gates have metaphorically flown open following the meteoric rise of the virtual machine in enterprise networks. As an industry, we have watched the abstraction of the operating system -- from the hardware it lives on -- give us an amazing amount of flexibility in the way we deploy and manage our systems.  Now that the industry has fully embraced the concept of abstraction, we aim to implement it everywhere.

 

Breaking away from monolithic stack architecture

 

If we take a look at systems specifically, it used to be that the hardware, the operating system, and the application all existed as one logical entity.  If it was a large application, we might have components of the application split out across multiple hardware/OS combos, but generally speaking the stack was a unit. That single unit was something we could easily recognize and monitor as a whole. SNMP, while it has its limitations, has done a decent job of allowing operators to query the state of everything in that single stack.

 

Virtualization changed the game a bit as we decoupled the OS/Application from the hardware. While it may not have been the most efficient way of doing it, we could still monitor the VM like we used to when it was coupled with the hardware.  This is because we hadn't really changed the architecture.  Abstraction gave us some significant flexibility but our applications still relied on the same components, arranged in a similar pattern to the bare-metal stacks we started with.  The difference is that we now had two unique units where information collection was required, the hardware remained as it always had and the OS/Application became a secondary monitoring target.  It took a little more configuration but it didn't change the nature of the way we monitored the systems.

 

Cloud architecture changes everything

 

Then came the concept of cloud infrastructure. With it, developers began embracing the elastic nature of the cloud and started building their products to take advantage of it. Rather than sizing an application stack based off of guesstimates of the anticipated peak load, it can now be sized minimally and scaled out horizontally when needed by adding additional instances. Previously, just a handful of systems would have handled peak loads. Now those numbers could be dozens, or even hundreds of dynamically built systems scaled out based on demand. As the industry moves in this direction, our traditional means of monitoring simply do not provide enough information to let us know if our application is performing as expected.

 

The networking story is similar in a lot of ways. While networking has generally been resistant to change over the past couple of decades, the need for dynamic/elastic infrastructure is forcing networks to take several evolutionary steps rather quickly.  In order to support the cloud models that application developers have embraced, the networks of tomorrow will be built with application awareness, self-programmability, and moment-in-time best path selection as core components.

 

Much like in the systems world, abstraction is one of the primary keys to achieving this flexibility. Whether the new model of networks is built upon new protocols, or overlays of existing infrastructure, the traditional way of statically configuring networks is coming to an end. Rather than having statically assigned primary, secondary, and tertiary paths, networks will balance traffic based off of business policy, link performance, and application awareness. Fault awareness will be built in, and traffic flows will be dynamically routed around trouble points in the network. Knowing the status of the actual links themselves will become less important, much like physical hardware that applications use. Understanding network performance will require understanding the actual performance of the packet flows that are utilizing the infrastructure.

 

At the heart of the matter, the end goal appears to be ephemeral state of both network path selection as well as systems architecture.

 

So how does this change monitoring?

 

Abstraction inherently makes application and network performance harder to analyze. In the past, we could monitor hardware state, network link performance, CPU, memory, disk latency, logs, etc. and come up with a fairly accurate picture of what was going on with the applications using those resources. Distributed architectures negate the correlation between a single piece of underlying infrastructure and the applications that use it.  Instead, synthetic application transactions and real-time performance data will need to be used to determine what application performance really looks like. Telemetry is a necessary component for monitoring next generation system and network architectures.

 

Does this mean that SNMP is going away?

 

While many practitioners wouldn't exactly shed a tear if they never needed to touch SNMP again, the answer is no. We still will have a need to monitor the underlying infrastructure even though it no longer gives us the holistic view that it once did. The widespread use of SNMP as the mechanism for monitoring infrastructure means it will remain a component of monitoring strategies for some time to come. Next generation monitoring systems will need to integrate the traditional SNMP methodologies with deeper levels of real-time application testing and awareness to ensure operators can remain aware of the environments they are responsible for managing.

Filter Blog

By date: By tag: