Skip navigation
1 6 7 8 9 10 Previous Next

Geek Speak

1,862 posts
sqlrockstar

The Actuator - May 18th

Posted by sqlrockstar Employee May 18, 2016

I am in Redmond this week to take part in the SQL Server® 2016 Reviewer's Workshop. Microsoft® gathers a handful of folks into a room and we review details of the upcoming release of SQL Server (available June 1st). I'm fortunate to be on the list so I make a point of attending when asked. I'll have more details to share later, but for now let's focus on the things I find amusing from around the internet...

 

How do you dispose of three Petabytes of disk?

And I thought having to do a few of these for friends and family was a pain, I can't imagine having to destroy this many disks. BTW, this might be a good time to remind everyone that data can never be created or destroyed, but it most certainly can be lost or stolen.

 

The Top Five Reasons Your Application Will Fail

Not a bad list, but the author forgot to list "crappy code, pushed out in a hurry, because agile is an excuse to be sloppy". No, I'm not bitter.

 

Audit: IT problems with TSA airport screening equipment persist

"The TSA's lack of server updates and poor oversight caused a plethora of IT security problems". Fortunately no one has any idea how many problems are in a plethora. Also? I know a company that makes tools to help fix such issues.

 

AWS Discovery Service Aims To Ease Legacy Migration Pain

Something tells me this tool is going to cause more pain when companies start to see just how much work needs to be done to migrate anything.

 

Bill Gates’ open letter

Wonderful article on how much the software industry has changed over the past 40 years. It will keep changing, too. I see the Cloud as a way for the software industry to change their licensing model from feature driven (Enterprise, Standard, etc.) to one driven by scalability and performance.

 

How to Reuse Waste Heat from Data Centers Intelligently

While this might sound good to someone, the reality is the majority of companies in the world do not have the luxury of building a data center from scratch, or even renovating existing ones. Still, it's interesting to understand just how much electricity data centers consume, and understand that the power has to come from somewhere.

 

How Much Does the Xbox One’s “Energy Saving” Mode Really Save?

Since we're talking about power usage, here's a nice example to help us understand how much extra it costs us to keep our Xbox always on. If it seems cheap to you then you'll understand how the cost of a data center may seem cheap to a company.

 

This week marks the fifth anniversary of my seeing the final launch of Endeavour, so I wanted to share something related to STS-134:

LaRockLaunch.jpg

 

Lastly, if you've been enjoying The Actuator please like, share, and/or comment. Thanks!

In the past few years, there has been a lot of conversation around the “hypervisor becoming a commodity." It has been said that the underlying virtualization engines, whether they be ESXi, Hyper-V, KVM etc. are essentially insignificant, stressing the importance of the management and automation tools that sit on top of them.

 

These statements do hold some truthfulness: in its basic form, the hypervisor simply runs a virtual machine. As long as end-users have the performance they need, there's nothing else to worry about. In truth, though, the three major hypervisors on the market today (ESXi, Hyper-V, KVM) do this, and they do it well, so I can see how the “hypervisor becoming a commodity” works in these cases. But to SysAdmins, the people managing everything behind the VM, the commoditized hypervisor theory isn't bought quite so easily.

 

When we think about the word commodity in terms of IT, it’s usually defined as a product or service that is indistinguishable to it’s competitors, except for maybe price. With that said, if the hypervisors were a commodity, we shouldn’t care what hypervisor our applications are running on. We should see no difference between the VMs that are sitting inside an ESXi cluster or a Hyper-V cluster. In fact, in order to be commodity, these VMs should be able to migrate between hypervisors. The fact is that VMs today are not interchangeable between hypervisors, at least not without changing their underlying anatomy. While it is possible to migrate between hypervisors, the fact of the matter is that there is a process that we have to follow, including configurations, disks, etc. The files that make up that VM are all proprietary to the hypervisor they are running on and cannot simply be migrated and run by another hypervisor in their native forms.

 

Also, we stressed earlier the importance of the management tools that lie above the hypervisor, and how the hypervisor didn’t matter as much as the management tools did. This is partly true. The management and automation tools put in place are the heart of our virtual infrastructures, but the problem is that these management tools often create a divide in the features they support on different hypervisors. Take, for instance, a storage array providing support for VVOLs, VMware’s answer to per-vm-based policy storage provisioning. This is a standard that allows us to completely change the way we deploy storage, eliminating LUNs and making VMs and their disk first-class citizens on their subsequent storage arrays. That said, these are storage arrays that are connected to ESXi hosts, not Hyper-V hosts.  Another example, this time in favor of Microsoft, is in the hybrid cloud space. With Azure stack coming down the pipe, organizations will be able to easily deploy and deliver services from their own data centers, but with azure-like agility. The VMware solution, which is similar, involving vCloud Air and vCloud Connector, is simply not at the same level as Azure when it comes to simplicity, in my opinion. They are two very different feature-sets that are only available on their respective hypervisors.

 

So with all that, is the hypervisor a commodity?  My take: No! While all the major hypervisors on the market today do one thing – virtualize x86 instructions and provide abstraction to the VMs running on top of them - there are simply two many discrepancies between the compatible 3rd-party tools, features, and products that manage these hypervisors for me to call them commoditized. So I’ll leave you with a few questions. Do you think the hypervisor is a commodity?  When/if the hypervisor fully becomes a commodity, what do you foresee our virtual environments looking like? Single or multi-hypervisor? Looking forward to your comments.

The other day we were discussing the fine points of running an IT Organization and the influence of People, Process and Technology on Systems Management and Administration, and someone brought up one of their experiences.   Management was frustrated at how it would take days for snapshots on their storage and virtualization platform was looking to replace their storage platform to solve this problem.  Clearly as this was a technology problem they sought out a solution which would tackle this and address the technology needs of their organization!  Chances are one or more of us have been in this situation before, so they did the proper thing and looked at the solutions!  Vendors were brought in, solutions spec’d, technical requirements were established and features were vetted.  Every vendor was given the hard and fast requirements of “must be able to take snapshots in seconds and present to the operating system to use in a writable fashion”.  Once all of the options were reviewed, confirmed, demo’d and validated they had made a solid solution!

 

Months followed as they migrated off of their existing storage platform onto this new platform, the light at the end of the tunnel was there, the panacea to all of their problems was in sight! And finally, they were done. Old storage system was decommissioned and the new storage system was put in place.  Management patted themselves on the back and they went about dealing with their next project, first and foremost on that list was the instantiation of a new Dev environment which would be based off of their production SAP data.   This being a pretty reasonable request they proceeded following their standard protocol to get it stood up, snapshots taken and presented.  Several days later their snapshot was presented as requested to the SAP team in order to stand up this Dev landscape.  And management was up in arms!

 

What exactly went wrong here? Clearly a technology problem had existed for the organization and a technology solution was delivered to act on those requirements.   Yet had they taken a step back for a moment and looked at the problem for it’s cause and not its symptoms they would have noticed that their internal SLAs and processes are really what was at fault, not the choice of technology.  Don’t get me wrong, some technology truly is at fault and a new technology can solve it, but to say that is the answer to every problem would be untrue, and some issues need to be looked at in the big picture.   To give you the true cause of their problem as their original storage platform COULD have met the requirements; was their ticketing process required multiple sign-offs for Change Advisory Board Management, approval and authorization, and the SLAs given to the storage team involved a 48-hour response time.  In this particular scenario the Storage Admins were actually pretty excited to present the snapshot so instead of waiting until the 48th hour to deliver, the provided it within seconds of the ticket making it into their queue.

 

Does this story sound familiar to you or your organization? Feel free to share some of your own personal experiences where one aspect of People, Process or Technology was blamed for the lack of agility in an organization and how you (hopefully) were able to overcome it?  I’ll do my best to share some other examples, stories and morals over these coming weeks!

 

I look forward to hearing your stories!

It was all about the network

 

In the past, when we thought about IT, we primarily thought about the network. When we couldn’t get email or access the Internet, we’d blame the network. We would talk about network complexity and look at influencers such as the number of devices, the number of routes data could take, or the available bandwidth.

 

As a result of this thinking, a myriad of monitoring tools were developed to help the network engineer keep an eye on the availability and performance of their networks and they provided basic network monitoring.

 

It’s now all about the service

 

Today, federal agencies cannot function without their IT systems being operational. It’s about providing critical services that will improve productivity, efficiency, and accuracy in decision making and mission execution. IT needs to ensure the performance and delivery of the application or service, and understand the application delivery chain.

 

Advanced monitoring tools for servers, storage, databases, applications, and virtualization are widely available to help diagnose and troubleshoot the performance of these services, but one fact remains: the delivery of these services relies on the performance and availability of the network. And without these critical IT services, the agency’s mission is at risk.

 

Essential monitoring for today’s complex IT infrastructure

 

Users expect to be able to connect anywhere and from anything. Add to that, IT needs to manage legacy physical servers, new virtual servers, and cloud infrastructure as well as cloud-based applications and services, and it is easy to see why basic monitoring simply isn’t enough. This growing complexity requires advanced monitoring capabilities that every IT organization should invest in.

 

Application-aware network performance monitoring provides visibility into the performance of applications and services as a result of network performance by tapping into the data provided by deep packet inspection and analysis.

 

With proactive capacity forecasting, alerting, and reporting, IT pros can easily plan for future needs, making sure that forecasting is based on dynamic baselines and actual usage instead of guesses.

 

Intelligent topology-aware alerts with downstream alert suppression will dramatically reduce the noise and accelerate troubleshooting.

 

Dynamic real-time maps provide a visual representation of a network with performance metrics and link utilization. And with the prevalence of wireless networks, adding wireless network heat maps is an absolute must to understand wireless coverage and ensure that employees can reach critical information wherever they are.

 

Current and detailed information about the network’s availability and performance should be a top priority for IT pros across the government. However, federal IT pros and the networks that they manage are responsible for delivering services and data that ensure that critical missions around the world are successful and that services are available to all citizens whenever they need them. This is no small task. Each network monitoring technique I discussed provides a wealth of data that federal IT pros can use to detect, diagnose, and resolve network performance problems and outages before they impact missions and services that are vital to the country.

 

Find the full article on our partner DLT’s blog, TechnicallySpeaking.

The increasing rate of change in applications and its amplitude footprint are causing a lot of consternation within IT organizations. It’s no coincidence, either, since everything revolves around the application, which is innovation personified. It’s the revenue-generating, value-added differentiation, and it's potentially an industry game changer. Think Uber, Facebook, Netflix, Airbnb, Amazon, and Alibaba.

 

Accordingly, the rate and scale of change are products in the application lifecycle. For instance, applications deployed in a virtualization stack will live for months or years, while applications deployed in a cloud stack will live for hours or weeks. Applications deployed in containers or with microservices will live for microseconds or milliseconds.

AppLifeCycle.png

From my Interop 2016 DART Framework presentation.

 

For IT professionals, it’s good to know where job security is. As such, I’ve been keeping monthly tabs of the number of jobs with the key words virtualization, cloud, or (containers AND microservices), on dice.com. In the past year, since June 2015, the number of jobs with the key word "virtualization" has remained flat with around 2600 job openings. In that same time frame, the number of cloud jobs has increased by over 30% to 8900 job openings, while the number of container/microservices jobs has more than doubled, reflecting almost 600 job openings.

 

These trends re-affirm the hybrid IT paradigm and the need to deal efficiently and effectively with change in their application ecosystem. Let me know what you think in the comment section below.

The vast majority of my customers are highly virtualized, and quite potentially using Amazon or Azure in a shadow IT kind of approach. Some groups within the organization have deployed workloads into these large public provider spaces. It’s simply due to these groups having the need to gain access to resources and deploy them as rapidly as possible.

 

Certainly Development and Testing groups have been building systems, and destroying them as testing moves forward toward production. But also, marketing, and other groups may find that the IT team is less than agile in providing these services on a timely basis. Thus, a credit card is swiped, and development occurs. The first indication that these things are taking place is when the bills come.

 

Often, the best solution is a shared environment in which certain workloads deployed into AWS, Azure or even Softlayer, into peer data centers for a shared, but less public workload provide ideal circumstances for the organization.

 

Certainly these services are quite valuable to organizations. But, is it secure, or does it potentially expose the company to vulnerabilities of data and/or potentially an entrée into the corporate network? Are there compliance issues? How about the costs? If your organization could provide these services in a way that would satisfy the user community, would that be a more efficient, cost-effective, compliant, and consistent platform?

 

These are really significant questions. The answers rarely, though, are simple. Today, there are applications, such as Cloudgenera which will analyze the new workload and advise the analyst as to whether any of these issues are significant. It’ll also advise as to current cost models to prove out the costs over time. Having that knowledge prior to deployment could be the difference between agility and vulnerability.

 

Another issue to be addressed with opening your environment up to a hybrid or public workload is the learning curve of adopting a new paradigm within your IT group. This can be daunting. To address these kinds of shifts in approach, a new world of public ecosystem partners have emerged. These tools, create workload deployment methodologies that bridge the gap between your internal virtual environment, and ease or even facilitate that transition. Tools like Platform9’s create what is essentially a software tool that allows the administrator to decide from within vCenter’s Platform9 panel where to deploy that workload. The deployment of this tool is as simple as downloading an OVF, and deploying it into your vCenter. Platform9 leverages the VMware API’s and the AWS API’s to integrate seamlessly into both worlds. Simple, elegant, and learning curve is minimal.

 

There are other avenues to be addressed, of course. For example, what about latencies to the community? Are there storage latencies? Network latencies? How about security concerns?

 

Well, analytics against these workloads as well as those within your virtual environment will no longer be a nice-to-have, but actually a must-have.

 

Lately, I’ve become particularly enthralled with the sheer level of log detail provided by Splunk. There are many SIEM (Security Information and Event Management) tools out there, but in my experience, no other tool gives the functional use as Splunk does. To be sure, other tools, like SolarWinds provide this level of analytics as well, and do so with aplomb. Splunk, as a data collector is unparalleled, but beyond that, the ability to tailor your dashboards to show you the trends, analytics, and pertinent data against all of that volume of data in a functional at-a-glance method. The tool’s ability to stretch itself to all your workloads, security, thresholds, etc., and to present it in such a way that the monitor panel or dashboard can show you so simply where your issues and anomalies lie.

 

There is a large OpenSource community of SIEM software as well. Tools such as OSSIM, Snort, OpenVAS and BackTrack are all viable options, but remember, as OpenSource, they rarely provide the robust dashboards that SolarWinds or Splunk do. They will, as OpenSource, cost far less, but may require much more hand-holding, and support will likely be far less functional.

 

When I was starting out in the pre-sales world, we began talking of the Journey to the Cloud. It became a trope.  We’re still on that journey. The thing is, the ecosystem that surrounds the public cloud is becoming as robust as the ecosystem that exists surrounding standard, on-prem workloads.

interop.logo.2.jpg

I'm flying home after another incredible Interop experience. It’s the perfect time to capture the conversations, ideas, and feelings I experienced this week in the desert, before they fade like the tan lines I got while waiting ten minutes outside for an Uber.

 

100Gbps (The summary)

 

If money was no object, I would honestly say that this should be on our MUST ATTEND list every year. Even as a conference newbie who probably missed a ton of opportunities along the way, Interop generated an incredibly diverse set of interactions, stories, and ideas.

 

Even if money is an object (which happens to be true for most people and organizations), I would still say that making Interop a priority would reap rewards that totally justify the expense.

 

While vendors are certainly present at Interop, the overall tone is refreshingly agnostic compared to events like Cisco Live, Microsoft Ignite, and VMworld. That means sessions are more focused on the real shortcomings of products and solutions, which allows for conversations about work-arounds, alternatives, and comprehensive solutions.

 

It's not hard to guess what the big stories were at the show this year: cloud, security, and SDN all had places in the sun. More surprising was the level to which the DevOps narrative bled into conversations that were once considered pure networking.

 

Fat Pipe (The details)

  1. One example of that DevOps/NetOps transition was a talk by Jason Edelman about using Ansible to perform configuration backups on legacy (meaning SSH-connected, command-line driven) network devices. While it might sound strange to the THWACKâ community, familiar as we are with tools like NCM, it represents an extension of existing skills and technology to teams that are used to using Ansible to deploy and manage cloud- and hybrid-cloud based environments.

 

  1. There were also a few deep-dive sessions on building and leveraging coding skills, such as Pythonä for network outcomes, mostly in relationship to SDN, NVF, and the like.

 

This, in turn, led to an ongoing dialogue between speakers and attendees in several sessions on the best ways for network professionals to identify, acquire, and develop new skills that will allow them to make the leap to the new age of networking.

 

All of this built up to a narrative that was best championed during Martin Casados’ keynote. In one of the best comparisons I've heard to date, Casados compared the current movement from traditional data centers, networking, server, and storage to the evolution from in-car navigation systems to running Waze on your phone.

 

He pointed out that every layer of the data center that once featured specialized hardware-based solutions are now completely contained at the software layer.

 

This overall shift is leading to the "rise of the developer,” as Casados put it. This means no silo will be safe from hardware being optimized by a software solution. It also means developers will have more influence over choosing operational frameworks, i.e., the solutions that run the business.

 

  1. Developers, Casados pointed out, care little for Gartnerâ, or vendor-specific certifications that tie IT pros to specific solutions, or sales relationships, or the vagaries of bureaucratic procurement cycles.

 

The result is that this shift in software-as-infrastructure has the potential to disrupt everything we used to know about the business of IT. 

 

Packet Footer (Summary)

Were you at InterOp and saw/heard/discussed something I missed? Do you have a different take than mine? Do you want to hear more on a specific topic? Let me know in the comments below!

 

All of this and more (I haven't even gotten into the discussions about IoT, SDN, or IPv6 that I was able to participate in), made this one of the best conferences I have attended in a very long time.

 

It got me even more excited for conferences to come. Next up is CiscoLive in Las Vegas, July 10-14. I hope to see you there!

Interop 2016 kicked off the week with two days of IT summits that covered an amazing range topics, including cloud, containers, and microservices, IT Leadership, and cybersecurity, plus hands-on hacking tutorials. The following three days included the Expo floor opening as well as the session tracks.

 

Since the IT Leadership Summit was sold out, I decided to join the Dark Reading Cyber Security Summit Day 1. I was only planning on attending Day 1, but the content was so good that I eschewed Container Summit and attended Dark Reading's Day 2. To kick things off, the editors at Dark Reading shared some interesting insights followed by industry thought leaders.

DevOps-Sec.png

DevOps - SecOps Relational image via @petecheslock and his Austin DevOps Days 2015 presentation.

 

My top 10 takeaways from the Dark Reading Cybersecurity Summit Days are below.

  1. $71.1B was spent on cybersecurity last year.
  2. Security pros spend most of their time patching legacy stuff and fixing vulnerabilities versus addressing targeted, sophisticated attacks, which happens to be their primary security concern. Number two is phishing and social engineering attacks.
  3. Security is one of the most important priorities and one of the least resourced by IT organizations. Security pros make policy decisions, but non-security people make purchasing decisions.
  4. The weakest link is the end-user, who make up the surface area of vulnerability.
  5. There are not enough skilled security ops people. 500K to 2M more security pros are needed by 2020.
  6. The most talented security pros are hackers.
  7. The average time to detect an intrusion is 6-7 months.
  8. 92% of the intrusions, incidents, and attacks of the past 10 years fall into nine distinct patterns, which can be further reduced down to three.
  9. The cost of a breach is roughly $254 per record for breaches, including 100 records, while $0.09 per record for breaches involving 100M records. Note that the cost is a multi-variable function with many dimensions to factor in.
  10. Only 40% of attacks are malware, so stopping malware is not enough.

 

Attached below is my DART IT Skills Framework presentation from my Interop IT Leadership speaking session. One of the CIO's SLA is security, so the Cybersecurity Summit was timely.

 

Let me know what you think of the security insights, as well as my presentation below, in the comment section. I would be happy to present my DART session to our community if there is enough interest, so let me know and I will make it so.

sqlrockstar

The Actuator - May 11th

Posted by sqlrockstar Employee May 11, 2016

I'm back from Liverpool and SQLBits. It was a brilliant event, as always. If you were there I hope you came by to say hello.

 

Here's this week's Actuator, filled with things I find amusing from around the Internet...

 

What is ransomware and how can I protect myself?

You recover from backups. If you don't have backups then you are hosed.

 

Ivy League economist ethnically profiled, interrogated for doing math on American Airlines flight

To be fair, he is a member of the al-Gebra movement, and was carrying weapons of math instruction.

 

The Year That Music Died

Wonderful interactive display of the top five songs every day since 1958. Imagine if you had this kind of interaction with your monitoring data, with some machine learning on top.

 

Apple Stole My Music. No, Seriously.

Since we are talking about music, here's yet another reason why reading the fine print is important.

 

Apple's Revenue Declines For The First Time In 13 Years

I am certain it has *nothing* to do with the issues inherent in their software and services like Apple Music. None.

 

The Formula One Approach to Security

This article marks the first time I have seen the phrase "security intelligence" and now I'm thinking it will be one of the next big buzzwords. Still a great read and intro to NetFlow for those that haven't heard about that yet.

 

Study: Containers Are Great, but Skilled Admins Are Scarce

I wonder how long they spent studying this. I believe it's always been the case that skilled admins are scarce, which is why we have so many accidental admins in the world. There's more tech work available than tech people available.

 

My secret to avoiding jet lag for events revealed:

NDQE7187 copy.jpg

In the world of networking, you would be hard pressed to find a more pervasive and polarizing topic than that of SDN. The concept of controller-based, policy-driven, and application-focused networks has owned the headlines for several years as network vendors have attempted to create solutions that allow everyone to operate with the optimization and automation as the large Web-scale companies do. The hype started in and around data center networks, but over the past year or so, the focus has sharply shifted to the WAN, for good reason.

 

In this three-part series we are going to take a look at the challenges of current WAN technologies, what SD-WAN brings to the table, and what some drawbacks may be in pursuing an SD-WAN strategy for your network.

 

Where Are We Now?

 

In the first iteration of this series, we’re going to identify and discuss some of the limitations in and around WAN technology in today’s networks. The lists below are certainly not comprehensive, but speak to the general issues faced by network engineers when deploying, maintaining, and troubleshooting enterprise WANs.

 

Perspective – The core challenge in creating a policy-driven network is perspective. For the most part, routers in today's networks make decisions independent of the state of peer devices. While there certainly are protocols that share network state information (routing protocols being the primary example), actions based off of this exchanged information are exclusively determined through the lens of the router's localized perspective of the environment.

 

This can cause non-trivial challenges in the coordination of desired traffic behavior, especially for patterns that may not follow the default/standard behavior that a protocol may choose for you. Getting every router to make uniform decisions, each utilizing a different perspective, can be a difficult challenge and add significant complexity depending on the policy trying to be enforced.

 

Additionally, not every protocol shares every piece of information, so it is entirely possible that one router is making decisions off of considerably different information than what other routers may be using.

 

Application Awareness - Routing in current generation network is remarkably simple. A router considers whether or not it is aware of the destination prefix, and if so, forwards the packet on to the next hop along the path. Information outside of the destination IP address is not considered when determining path selection.  Deeper inspection of the packet payload is possible on most modern routers, but that information does not play into route selection decisions. Due to this limitation in how we identify forwarding paths, it is incredibly difficult to differentiate routing policy based off of the application traffic being forwarded.

 

Error Detection/Failover – Error detection and failover in current generation routing protocols is a fairly binary process. Routers exchange information with their neighbors, and if they don’t hear from them in some sort of pre-determined time window, they tear down the neighbor relationship and remove the information learned from that peer. Only at that point will a router choose to take what it considers to be an inferior path. This solution works well for black-out style conditions, but what happens when there is packet loss or significant jitter on the link? The answer is that current routing protocols do not take these conditions into consideration when choosing an optimal path. It is entirely possible for a link to have 10% packet loss, which significantly impact voice calls, and have the router plug along like everything is okay since it never loses connection with its neighbor long enough to tear down the connection and choose an alternate path. Meanwhile, a perfectly suitable alternative may be sitting idle, providing no value to the organization.

 

Load Balancing/Efficiency - Also inherent in the way routing protocols choose links is the fact that all protocols are looking to identify the single best path (or paths, if they are equal cost) and make it active, leaving all other paths passive until the active link(s) fail. EIGRP could be considered an exception to this rule as it allows for unequal cost load balancing, but even that is less than ideal since it won’t detect brown-out conditions on a primary link and move all traffic to the secondary. This means that organizations have to purchase far more bandwidth than necessary to ensure each link, passive or active, has the ability to support all traffic at any point. Since routing protocols do not have the ability to load balance based off of application characteristics, load balancing and failover is an all or nothing proposition.

 

As stated previously, the above list is just a quick glance at some of the challenges faced in designing and managing the WAN in today’s enterprise network.  In the second part of this series we are going to take a look at what SD-WAN does that helps remediate many of the above challenges.  Also keep your eyes peeled for Part 3, which will close out the series by identifying some potential challenges surrounding SD-WAN solutions, and some final thoughts on how you might take your next step to improving your enterprise’s WAN.

Did the title of this blog entry scare you and make you think, "Why in the world would I do that?"  If so, then there is no need to read further.  The point of this blog post is not to tell you why you should be doing so, only why some have chosen to do so, and what issues they find themselves dealing with after having done so. If you still think that the idea of moving any of your data center to the cloud is simply ludicrous, you may go back to your regularly scheduled programming.

 

If the demand for on your company's IT resources is consistent throughout the week and year, then the biggest reason for moving to the cloud really doesn't apply to you.  Consider how Amazon Web Services (AWS) got built. They discovered that most of the demand on their company's IT resources came from a few days of the year: Black Friday, Mother's Day, Christmas, etc. The rest of the year, the bulk of their IT resources were going unused. They asked themselves whether there might be other people who had the need for their IT resources when they weren't using them, and AWS was born. It has, of course, grown well beyond the simple desire to sell excess capacity into one of their most profitable business lines.


If your company's IT systems have a demand curve like that, then the public cloud might be for you. Why pay for servers to sit there for an entire year when you can rent them when demand is high and give them back when demand is low?  In fact, some companies even rent extra computing capacity by the hour when the demand is high. Imagine being able to scale the capabilities of your data center within minutes in order to meet the increased demand created by a Slashdot article or a viral video. This is the reason to go to the cloud. Then, once the demand goes down, simply give that capacity back.

 

The challenge for IT people looking to replace portions of their data center with the public cloud is automating it, and making sure that what they automate fits within the budget.  While a public cloud vendor can typically scale to whatever demand level you find yourself with, the bill will automatically scale as well. Unless the huge spike in demand is directly related to a huge spike in sales, your CFO might not take kindly to an enormous bill when your video goes viral. Make sure you plan for that ahead of time so you don't end up having to pay a huge and unexpected cost. Perhaps the decision will be made to just let things get slow for a while. After all, that ends up in the news, too. And if you believe all publicity is good publicity, then maybe it wouldn't be such a bad thing.

 

There are plenty of companies that have replaced all their data centers with the cloud. Netflix is perhaps the most famous company that runs their entire infrastructure in AWS.  But they argue that the constant changes in demand for their videos make them a perfect match for such a setup. Make sure the way your customers use your services is consistent with the way the public cloud works, and make sure that your CFO is ready for the bill if and when it happens. That's how to move things into the cloud.

As an avid cloud user, I'm always amused by people who suggest that moving things to the cloud means you don't have to manage them.  And, of course, when I say "amused," what I really mean is I feel lnigo Montoya in Princess Bride.  "You keep using that word.  I do not think it means what you think it means."

 

Why do I say this?  Because I am an avid cloud user and I manage my cloud assets all the time.  So where do we get this idea?  I'd say it starts with the idea that you don't have to manage the hardware.  Push a few buttons and a "server" magically appears in your web browser.  This is so much easier than creating a real server, which actually works similarly these days.  Push a few buttons on the right web site, and an actual server shows up at your front door in a few days.  All you have to do is plug it in, load the appropriate OS and application stack and you're ready to go.  The cloud VM is a little bit easier.  It appears in minutes and comes preloaded with the OS and application stack that you specified during the build process.

 

I think what most people think when they say their cloud resources don't need to be managed is that they don't have to worry about the hardware.  They know that the VM is running on highly resilient hardware that is being managed for them.  They don't have to worry about a failed disk drive, network controller, PCI card, etc.  It just manages itself. But anyone who thinks this is all that needs to be managed for a server must never have actually managed any servers.

 

There are all sorts of things that must be managed on a server that have nothing to do with hardware.  What about the filesystems?  When you create the VM, you create it with a volume of a certain size.  You need to make sure that volume doesn't fill up and take your server down with it.  You need to monitor the things that would fill it up for no reason, such as web logs, error logs, database transaction logs, etc.  These need to be monitored and managed.  Speaking of logs, what about those error logs?  Is anyone looking at them? Are they scanning them for errors that need to be addressed?  Somebody should be, of course.

 

Another thing that can fill up a filesystem is an excessive number of snaphshots.  They need to be managed as well.  Older snapshots need to be deleted and certain snapshots may need to kept for longer periods of time or archived off to different medium. Snapshots do not manage themselves.

 

What about my favorite topic of backups?  Is that VM getting backed up?  Does it need to be?  If you configured it to be backed up, is it backing up?  Is anyone looking at those error logs?  One of the biggest challenges is figuring out when a backup didn't run. It's relatively easy to figure out when a backup ran but failed; however, if someone configured the backup to not run at all, there's no log of that.  Is someone looking for backups that just magically disappeared?


Suffice it to say that the cloud doesn't remove the need for management.  It just moves it to a different place.  Some of these things may be able to be offloaded to the cloud vendor, of course.  But even if that's the case someone needs to watch the watcher.  There is no such thing as free lunch and there is no such thing as a server that manages itself.

Network variation is hurting us

Network devices like switches, routers, firewalls and load-balancers ship with many powerful features. These features can be configured by each engineer to fit the unique needs of every network. This flexibility is extremely useful and, in many ways, it's what makes networking cool. But there comes a point at which this flexibility starts to backfire and become a source of pain for network engineers.

Variation creeps up on you.  It can start with harmless requests for some non-standard connectivity, but I've seen those requests grow to the point where servers were plugging straight into the network core routers.  In time, these one-off solutions start to accumulate and you can lose sight of what the network ‘should’ look like.  Every part of the network becomes its own special snowflake.

I’m not judging here. I've managed quite a few networks and all of them end up with high-degrees of variation and technical debt. In fact, it takes considerable effort to fight the storm of snowflakes. But if you want a stable and useful network you need to drive out variation. Of course you still need to meet the demands of the business, but only up to a point. If you're too flexible you will end up hurting your business by creating a brittle network which cannot handle changes.

Your network becomes easier and faster to deploy, monitor, map, audit, understand and fix if you limit your network to a subset of standard components. Of course there are great monitoring tools to help you manage messy networks, but you’ll get greater value from your tools when you point them towards a simple structured network.

What’s so bad about variety?

Before we can start simplifying our networks we have to see the value in driving out that variability. Here are some thoughts on how highly variable (or heterogeneous) networks can make our lives harder as network engineers:

  • Change control - Making safe network change is extremely difficult without standard topologies or configurations. Making a change safely requires a deep understanding of the current traffic flows - and this will take a lot of time. Documentation makes this easier, but a simple standardized topology is best. The most frustrating thing is that when you do eventually cause an outage, the lessons learned from your failed change cannot be applied to other dissimilar parts of your network.
  • Discovery time can be high. How do you learn the topology of your network in advance of problems occurring? A topology mapping tool can be really helpful to reduce the pain here, but most people have just an outdated visio diagram to rely on.
  • Operations can be a nightmare in snowflake networks.  Every problem will be a new one, but probably one that could have been avoided - it's likely that you'll go slowly mad. Often you'll start troubleshooting a problem and then realize, ‘oh yeah, I caused this outage with the shortcut I took last week. Oops’.  By the way, it’s a really good sign when you start to see the same problems repeatedly. Operations should be boring, It means you can re-orient your Ops time towards 80/20 analysis of issues, rather that spending your days firefighting.
  • Stagnation -  You won't be able to improve your network until you simplify and standardize your network. Runbooks are fantastic tools for your Ops and Deployment teams, but the runbook will be useless if the steps are different for every switch in your network. Think about documenting a simple task...if network Y do step1, except if feature Z enabled then do something else, except if it’s raining or if it's a leap year.  You get the message.
  • No-Automation - If your process it too complicated to capture in a runbook you shouldn't automate it. Simplify your network, then your process, then automate.

 

Summary

Network variation can be a real source of pain for us engineers. In this post we looked at the pain it causes and why we need to simplify and standardize our networks. In Part 2 we'll look at the root causes for these complicated, heterogenous networks and how we can begin tackling the problem.

Data center consolidations have been a priority for years, with the objectives of combatting server sprawl, centralizing and standardizing storage, and streamlining application management and establishing shared services across multiple agencies.

 

But, consolidation has created challenges for federal IT professionals, including:

  • Managing the consolidation without an increase in IT staff
  • Adapting to new best practices like shared services and cloud computing
  • Shifting focus to optimizing IT through more efficient computing platforms

 

Whether agencies have finished their consolidation or not, federal IT pros have definitely felt the impact of the change. But how do the remaining administrators manage the growing infrastructure and issues while meeting SLAs?

 

One way data center administrators can stay on top of all the change is to modernize their monitoring system, with the objective of improved visibility, and troubleshooting.

 

The Value of Implementing Holistic Monitoring

 

A holistic approach to monitoring provides visibility into how each individual component is running and impacting the environment as a whole. It can bridge the gap that exists between the IT team and the program groups through connected visibility.

 

Responsibility

Who is responsible for what? Shared services can be hard to navigate.

 

Even though the data center team now owns the infrastructure and application operations, the application owners still need to ensure application performance. Both teams require visibility into performance with a single point of truth, which streamlines communication and eases the transition to shared services.

 

Application Performance

Application performance is critical to executing agency missions, so when users provide feedback that an application is slow, it is up to data center administrators to find the problem and fix it—or escalate it—quickly.

 

Individually checking each component of the IT infrastructure—the application, servers, storage, database or a virtualized environment—can be tedious, time consuming and difficult. End-to-end visibility into how each component is performing, allows for quick identification and remediation of the issues.

 

Virtualization

Virtualization can introduce complexities and management challenges. In a virtual environment, virtual machines can be cloned and moved around so easily and often that the impact on the entire environment can be missed, especially in a dynamically changing infrastructure.

 

Consolidated monitoring and comprehensive awareness of the end-to-end virtual environment is the answer to effective change management in the virtualized environment.

 

Efficiency

Efficiency was a key driver behind consolidations, but this can seem near impossible for the remaining data centers. But with integrated monitoring that provides end-to-end visibility, data center administrators can troubleshoot issues in seconds instead of hours or days and proactively manage their IT. With the right tools, administrators can provide end-users with high service levels.

 

Consolidation is part of the new reality for data center administrators. Holistic, integrated monitoring and management of the dynamically changing IT environment will help to refine the new responsibilities of being a shared service, ensure mission-critical applications are optimized and improve visibility into virtualized environments.

 

Find the full article on Signal.

Practitioners in nearly every technology field are facing revolutionary changes in the way systems and networks are built. Change, by itself, really isn't all that interesting. Those among us who have been doing this a while will recognize that technological change is one of the few reliable constants. What is interesting, however, is how things are changing.

 

Architects, engineers, and the vendors that produce gear for them have simply fallen in love with the concept of abstraction. The abstraction flood gates have metaphorically flown open following the meteoric rise of the virtual machine in enterprise networks. As an industry, we have watched the abstraction of the operating system -- from the hardware it lives on -- give us an amazing amount of flexibility in the way we deploy and manage our systems.  Now that the industry has fully embraced the concept of abstraction, we aim to implement it everywhere.

 

Breaking away from monolithic stack architecture

 

If we take a look at systems specifically, it used to be that the hardware, the operating system, and the application all existed as one logical entity.  If it was a large application, we might have components of the application split out across multiple hardware/OS combos, but generally speaking the stack was a unit. That single unit was something we could easily recognize and monitor as a whole. SNMP, while it has its limitations, has done a decent job of allowing operators to query the state of everything in that single stack.

 

Virtualization changed the game a bit as we decoupled the OS/Application from the hardware. While it may not have been the most efficient way of doing it, we could still monitor the VM like we used to when it was coupled with the hardware.  This is because we hadn't really changed the architecture.  Abstraction gave us some significant flexibility but our applications still relied on the same components, arranged in a similar pattern to the bare-metal stacks we started with.  The difference is that we now had two unique units where information collection was required, the hardware remained as it always had and the OS/Application became a secondary monitoring target.  It took a little more configuration but it didn't change the nature of the way we monitored the systems.

 

Cloud architecture changes everything

 

Then came the concept of cloud infrastructure. With it, developers began embracing the elastic nature of the cloud and started building their products to take advantage of it. Rather than sizing an application stack based off of guesstimates of the anticipated peak load, it can now be sized minimally and scaled out horizontally when needed by adding additional instances. Previously, just a handful of systems would have handled peak loads. Now those numbers could be dozens, or even hundreds of dynamically built systems scaled out based on demand. As the industry moves in this direction, our traditional means of monitoring simply do not provide enough information to let us know if our application is performing as expected.

 

The networking story is similar in a lot of ways. While networking has generally been resistant to change over the past couple of decades, the need for dynamic/elastic infrastructure is forcing networks to take several evolutionary steps rather quickly.  In order to support the cloud models that application developers have embraced, the networks of tomorrow will be built with application awareness, self-programmability, and moment-in-time best path selection as core components.

 

Much like in the systems world, abstraction is one of the primary keys to achieving this flexibility. Whether the new model of networks is built upon new protocols, or overlays of existing infrastructure, the traditional way of statically configuring networks is coming to an end. Rather than having statically assigned primary, secondary, and tertiary paths, networks will balance traffic based off of business policy, link performance, and application awareness. Fault awareness will be built in, and traffic flows will be dynamically routed around trouble points in the network. Knowing the status of the actual links themselves will become less important, much like physical hardware that applications use. Understanding network performance will require understanding the actual performance of the packet flows that are utilizing the infrastructure.

 

At the heart of the matter, the end goal appears to be ephemeral state of both network path selection as well as systems architecture.

 

So how does this change monitoring?

 

Abstraction inherently makes application and network performance harder to analyze. In the past, we could monitor hardware state, network link performance, CPU, memory, disk latency, logs, etc. and come up with a fairly accurate picture of what was going on with the applications using those resources. Distributed architectures negate the correlation between a single piece of underlying infrastructure and the applications that use it.  Instead, synthetic application transactions and real-time performance data will need to be used to determine what application performance really looks like. Telemetry is a necessary component for monitoring next generation system and network architectures.

 

Does this mean that SNMP is going away?

 

While many practitioners wouldn't exactly shed a tear if they never needed to touch SNMP again, the answer is no. We still will have a need to monitor the underlying infrastructure even though it no longer gives us the holistic view that it once did. The widespread use of SNMP as the mechanism for monitoring infrastructure means it will remain a component of monitoring strategies for some time to come. Next generation monitoring systems will need to integrate the traditional SNMP methodologies with deeper levels of real-time application testing and awareness to ensure operators can remain aware of the environments they are responsible for managing.

Filter Blog

By date:
By tag: