Skip navigation
1 15 16 17 18 19 Previous Next

Geek Speak

2,019 posts

This is the second installment of a three-part series discussing SD-WAN (Software Defined WAN), what current problems it may solve for your organization, and what new challenges it may introduce. Part 1 of the series, which discusses some of the drawbacks and challenges of our current WANs, can be found HERE.  If you haven’t already, I would recommend reading that post before proceeding.

 

Great!  Now that everyone has a common baseline on where we are now, the all-important question is…

 

Where are we going?

 

This is where SD-WAN comes into the picture.  SD-WAN is a generic term for a controller driven and orchestrated Wide Area Network.  I say it’s generic because there is no definition of what does and does not constitute an SD-WAN solution and as can be expected, every vendor approaches these challenges from their own unique perspectives and strengths.  While the approaches do have unique qualities about them, the reality is that they are all solving for the same set of problems and consequently have been coming to form a set of similar solutions.  Below we are going to take a look at these “shared” SD-WAN concepts on how these changes in functionality can solve some of the challenges we’ve been facing on the WAN for a long time.

 

Abstraction – This is at the heart of SD-WAN solutions even though abstraction in and of itself isn't a solution to any particular problem. Think of abstraction like you think about system virtualization.  All the parts and pieces remain but we separate the logic/processing (VM/OS) from the hardware (Server).  Although in the WAN scenario we are separating the logic (routing, path selection) from the underlying hardware (WAN links and traditional routing hardware).

 

The core benefit of abstraction is that it increases flexibility in route decisions and reduces dependency on any one piece of underlying infrastructure.  All of the topics below build upon this idea of separating the intelligence (overlay) from the devices responsible for forwarding that traffic (underlay).  Additionally, abstraction reduces the impact of any one change in the underlay, again drawing parallels from the virtualization of systems architecture.  Changing circuit providers or routing hardware in our current networks can be a time consuming, costly and challenging tasks.  When those components exist as part of an underlay, migration from one platform to another, or one circuit provider to another, becomes a much simpler task.

 

Centralized Perspective - Unlike our current generation of WANs, SD-WAN networks almost universally utilize some sort of controller technology.  This centrally located controller is able to collect information on the entirety of the network and intelligently influence traffic based on analysis of the performance of all hardware and links.  These decisions then get pushed down to local routing devices to enforce the optimal routing policy determined by the controller.

 

This is a significant shift from what we are doing today as each and every routing device is making decisions off of a very localized view of the network and only is only aware of performance characteristics for the links it is directly connected to.  By being able to see trouble many hops away from the source of the traffic, a centralized controller can route around it at the most opportune location, providing the best possible service level for the data flow.

 

Application Awareness - Application identification isn't exactly new to router platforms.  What is new, is the ability to make dynamic routing decisions based off of specific applications, or even sub-components of those applications.  Splitting traffic between links based off of business criticality and ancillary business requirements has long been a request of both small and large shops alike.  Implementing these policy based routing decisions in the current generation networks has almost always resulted in messy and unpredictable results.

 

Imagine being able to route SaaS traffic directly out to the internet (since we trust it and it doesn’t require additional security filtering), file sharing across your internet based IPSec VPN (since performance isn’t as critical as other applications), and voice/video across an MPLS line with an SLA (since performance, rather than overall bandwidth, are more important).  Now add 5% packet loss on your MPLS link… SD-WAN solutions will be able to dynamically shift your voice/video traffic to IPSec VPN since overall performance is better on that path.  Application centric routing, policy, and performance guarantees are significant advancements made possible with a centralized controller and abstraction.

 

Real Time Error Detection/Telemetry – One of the most frustrating conditions to work around on today’s networks is a brown out type condition that doesn’t bring down a routing protocol neighbor relationship.  While a visible look at the interfaces will tell you there is a problem, if the thresholds aren’t set correctly manual intervention is required to route around such a problem.  Between the centralized visibility of both sides of the link and the collection/analysis of real time telemetry data provided by a controller based architecture, SD-WAN solutions have the ability to route around these brown out conditions dynamically.  Below are three different types of error conditions one might encounter on a network and how current networks and SD-WAN networks might react to them.  This comparison is done based off a branch with 2 unique uplink paths.

 

Black Out:  One link fully out of service.

Current Routers:  This is handled well by current equipment and protocols.  Traffic will fail over to the backup link and only return once service has been restored.

SD-WAN:  SD-WAN handles this in identical fashion.

 

Single Primary Link Brown Out:  Link degradation (packet loss or jitter) is occurring on only one of multiple links.

Current Routers: Traditional networks don't handle this condition well until the packet loss is significant enough for routing protocols to fail over.  All traffic will continue to use the degraded link, even with a non-degraded link available for use.

SD-WAN:  SD-WAN solutions have the advantage of centralized perspective and can detect these conditions without additional overhead of probe traffic.  Critical traffic can be moved to stable links, and if allowed in the policy, traffic more tolerant of brown out conditions can still use the degraded link.

 

Both Link Brown Out:  All available links are degraded.

Current Routers:  No remediation possible.  Traffic will traverse the best available link that can maintain a routing neighbor relationship.

SD-WAN:  Some SD-WAN solutions provide answers even for this condition.  Through a process commonly referred to as Forward Error Correction, traffic is duplicated and sent out all of your degraded links.  A small buffer is maintained on the receiving end and packets are re-ordered once they are received.  This can significantly improve application performance even across multiple degraded links.

 

Regardless of the specific condition, the addition of a controller to the network gives a centralized perspective and the ability to identify and make routing decisions based on real-time performance data.

 

Efficient Use of Resources - This is the kicker, and I say that because all of the above solutions solve truly technical problems.  This one hits home where most executive care the most.  Due to the active/passive nature of current networks, companies who need redundancy are forced to purchase double their required bandwidth capacity and leave 50% of it idle when conditions are nominal.  Current routing protocols just doesn't have the ability to easily utilize disparate WAN capacity and then fall back to a single link when necessary.

 

Is it better to pay for 200% of the capacity you need for the few occasions when you need it, or pay for 100% of what you need and deal with only 50% capacity when there is trouble?

 

To add to this argument, many SD-WAN providers are so confident in their solutions that they pitch being able to drop more expensive SLA based circuits (MPLS/Direct) in favor of far cheaper generic internet bandwidth.  If you are able to procure 10 times the bandwidth, split across 3 diverse providers, would your performance be better than a smaller circuit with guaranteed bandwidth even with the anticipated oversubscription?  These claims need to be proven out but the intelligence that the controller based overlay network gives you could very well prove to negate the need to pay for provider based performances promises.

 

Reading the above list could likely convince someone that SD-WAN is the WAN panacea we’ve all been waiting for.  But, like all technological advancement, it’s never quite that easy.  Controller orchestrated WANs make a lot of sense in solving some of the more difficult questions we face with our current routing protocols but no change comes without its own risks and challenges.  Keep a look out for the third and final installment in this series where we will address the potential pitfalls associated with implementing an SD-WAN solution and discuss some ideas on how you might mitigate them.

Another important tip from your friendly neighborhood dev

By Corey Adler, Professional Software Developer

 

Greetings again, Thwack! It is I, your friendly neighborhood dev back again with another tip to help you with your code education. I appreciate all of the wonderful feedback that I’ve seen so far on my previous posts( So You’re Learning to Code? Let’s Talk, Still Learning to Code? Here’s a Huge Tip, and MOAR CODING TIP! AND LOTS OF CAPS! ), and am grateful to be able to help you out in any way that I can. Although my previous posts have dealt typically with actually coding, this post is going to deal with something else. You see, I’ve been here trying to help you now for 3 posts, and I’d like to think that I’m doing fairly well on that score, but you should know that in programming, much like in other fields as well, there are people that mean well when they want to help you but will fail miserably when push comes to shove. The kinds of people who shouldn’t be allowed to teach at all—even when they are motivated and want to do so. When these people do come to your aid you’ll typically end up the worse for it. And so, to that end:

 

                6) STAY AWAY FROM BROGRAMMERS. VERY, VERY, VERY FAR AWAY.

 

So what is a brogrammer, you ask? According to Urban Dictionary, a brogrammer is defined as: “A programmer who breaks the usual expectations of quiet nerdiness and opts instead for the usual trappings of a frat-boy: popped collars, bad beer, and calling everybody ‘bro’. Despised by everyone, especially other programmers.”

 

Now, before we get any further on this I want to emphasize that I am not singling out specific people. This is not a personal attack on anyone (in case any of the ones that I’ve met for some reason end up on Thwack, which I highly doubt). This is entirely a professional rebuke of the whole brogrammer mindset and attitude. It’s become more and more commonplace amongst entry-level software developers, and is one that can be highly toxic to those projects or people that are associated with them.

 

There are, traditionally, 2 problem areas that a brogrammer fits right into:

  1. Sexism
  2. Bad Craft

 

Let’s tackle these in reverse order.

 

What do I mean by “bad craft”? Oftentimes with brogrammers you will find an unwillingness to spend the extra time and attention to detail needed to create good code. Brogrammers tend to care about coming in each day, doing their 9-to-5 (not a minute more), and leaving to go have fun with their friends. They tend not to try and better their craft unless it’s absolutely necessary for them to keep their jobs (and beer money). Let me illustrate with an example: There exists a well-known company (name withheld for probable legal reasons) about which a former employee wrote the following:

 

[Company] is run on a tangled mess of homegrown tools, horrendously fragile code and the worst engineering practices I've ever seen from any company. There is no QA, code reviews aren't taken seriously, anyone can commit to master and push their code to production at any time…Brogramming is real and [Company] exemplifies it. It was the norm for bros to knowingly push buggy, incomplete, untested code into production after a few rounds of drinks then leave the problems for others while they moved onto another project.”

 

Or how about from this company:

 

“Once you're in, you'll be subject to one of the most cliquish offices I've had the "pleasure" to work in (I've been in IT for 8 years). The best word I can use to describe the existing staff: brogrammers. Definitely male-dominated. Every guy is an alpha male who probably rather go and lift weights, instead of writing solid code. The code base is a mess. Barely documented, sloppy, and basically changed on an ad hoc basis (if you ask about requirements documents you'll be laughed at). Everyone seems far too busy to do any kind of peer-review and coaching amounts to an irritated developer grudgingly taking a moment to help, and the help amounts to the developer doing it and basically saying, "there" and leaving.”

 

As someone who is starting and out, and trying to learn the basics of programming it is incumbent on you to find the best role models, who will help you learn and grow in the ways of the Force (i.e. programming). Brogrammers are the complete opposite of what you need. They don’t want to use good programming practices or have clean, understandable code. In such circumstances you would most likely be worse off after asking them for help.

 

Having said all that about “bad craft”, let me just mention that this problem is a drop in the bucket compared to the other problem—that of rampant sexism. Now, there is no possible way for me to fully cover all of the horror that is sharing an industry with these “people” in a short space such as this, but let me at least give you a taste of what I’m talking about. You see, because the above stuff that I wrote about bad craft? I wrote that 6 months ago. This next part? It’s taken a while. It’s been difficult to put my feelings into words that I can type without dry heaving into the wastebasket next to me at the thought that these “people” use the same tools that I do, use the same languages that I do, and pretend that their jobs are the same as mine. Let’s see some of the horror stories that I found in my research, both online and talking to my female co-workers.

 

  • Take, for example, a very famous social media company who decided to have an employee appreciation party. They even, it being an employee appreciation event, decided to let the employees vote and decide what type of party it was going to be. Not a bad idea, right? Pretty innocuous…until they voted to have it in the style of a frat party. The real cherry on top? Said company was embroiled in a lawsuit brought by a female former engineer for discrimination against women. How inclusive of them.
  • Or those times that my friend C, a fellow developer, would try to make suggestions to a client about the project she was working on and have them be rejected…only for said client to accept it once one of her male coworkers suggested it.
  • Or, as Leon sent me on Twitter, that time that someone spoke at SQL Saturday and received an evaluation form from someone that suggested that she could improve her speaking by wearing a negligee. (http://bit.ly/1TmgQ5a)
  • Or the engineer who was asked whether her job at the company was to “photocopy [stuff]” and that she was “too pretty to code.” (http://bit.ly/1CHmRqx)
  • Or the whole ball of putrid, stinking mess that is Gamergate.
  • Or the women who have gone to conferences…and been groped. (http://read.bi/1YGqcMI)
  • Or the developer conference in which a company sponsored a Women in Games lunch…and then had an after-party featuring scantily clad women dancing (http://bit.ly/1VqXgcl)
  • Or any of the tons of other stories that I found online while feeling my heart sink deeper and deeper, wondering how the heck we in programming and technology as a whole fell this low this fast.

 

The industry of Grace Hopper. Ada Lovelace. The ENIAC Programmers. Adele Goldberg. So much of the foundations for what we do every single working day of our lives we owe to these and other noble women who labored to provide us, their future, with this amazing gift. And now? We get excited that big tech companies like Google and eBay have 17% of their developer force as women! 17%. Since when has 17% been a big number? 17% is only a big number when you’re a 3rd-party candidate running for public office. For myself, I went through college only once having a computer science major class with more than 3 women in it, and that was only because it was cross-listed with the art department—which 2 of the women were from. We once could proudly say that nearly 40% of our graduates were women. Now? It’s less than half of that. Sure, there are tons of factors that can play into that, but none of them hurts more, none of them is quite as much a punch to the gut as brogrammers.

 

Because this is not who we are. This is not who we are supposed to be. We can and should be a hell of a lot better than this! We are the industry of outsiders. We are ones who played D&D and Magic: The Gathering in high school instead of trying to be like everyone else. We are the ones who talked about Science Fiction until people’s ears fell off. We are the ones who, when other industries demand suits, ties, and other more “professional” attire, said “No, thank you.” We were going to show all of them how we do things—our way. And yet we’ve fallen victim to this just like everyone else has. We let this happen to ourselves. We have no one else to blame. No one else that we should blame except ourselves.

 

The first step in solving a problem is recognizing that there is one. Brogrammers are a dollar store mustard stain on our industry, and one that seemingly keeps on growing. Their sexism as well as their lack of caring for the craft that we hold dear to our hearts hurts all of us who strive for better every day. Who think that these attitudes are obtuse, thick-headed, and generally uncouth. But even you, the programming novice can help out. Even you can do just one tiny little thing to help us fight this scourge of human garbage. Stay very far away from them. Don’t give them, with their pathetic, tiny little brains, the dignity of even acknowledging that they have some skill. And, if it’s not too much trouble, be sure to tell them all of this if they ever ask you why you don’t ask them for help. Because we don’t want them here. Don’t be quiet. Don’t be afraid to stand up against them. And don’t ever make the mistake of thinking that they have anything that we should learn from.

 

</rant>

 

Let me know how you feel in the comments below, and please feel free to share your stories about these and other tech troglodytes either here or by messaging me on Thwack. Until next time, I wish you good coding.

 

https://thwack.solarwinds.com/community/solarwinds-community/geek-speak_tht/blog/2015/12/01/moar-coding-tip-and-lots-of-caps

More often then not application owners look to their vendors to provide a list of requirements for a new project, the vendor forwards specifications that were developed around maximum load and in the age of physical servers.  These requirements eventually make their way on to the virtualization administrators desk.  8 Intel CPUs 2Ghz or higher, 16 GB Memory.   The virtualization administrator is left to fight the good fight with both the app owner and the vendors.  Now we all cringe when we see requirements such as above – we’ve worked hard to build out our virtualization clusters, pooling CPU and Memory to present back to our organizations, and we constantly monitor our environments to ensure that resources are available to our applications when they need them – and then a list of requirements like above come across through a server build form or email.

 

So whats the big deal?

 

We have lots of resources right?!? Why not just give the people what they want?  Before you just start giving away resources let’s take a look at both CPU and Memory and see just how VMware handles its scheduling and over commitment of both..

 

CPU

 

One of the biggest selling points of virtualization is the fact that we can have more vCPUs attached to VMs then we have physical CPUs on our hosts and let the ESXi scheduler take care of scheduling the VMs time on CPU.  So why don’t we just go ahead and give every VM 4 or 8 vCPUs?  You would think granting a VM more vCPUs would increase its performance – which it most certainly could – but the problem is you can actually hurt performance as well – not just on that single VM, but on other VMs running on the host as well.   Since the physical CPUs are shared there may be times where the scheduler will have to place CPU instructions on hold, or wait for physical cores to become available.  For instance, a VM with 4 vCPU’s will have to wait until 4 physical cores are available before the scheduler and execute its’ instructions, where-as a VM with 1 vCPU only has to wait for 1 logical core.  As you can tell, having multiple VMs each containing multiple vCPUs could in fact end up with a lot of queuing, waiting, and CPU ready time on the host, resulting in a significant impact to our performance.  Although VMware has made strides in CPU scheduling by implementing “Relaxed Co-scheduling”, it still only allows for a certain time drift between the execution of instructions across cores, and does not completely solve the issues around scheduling and CPU Ready – It’s always best practice to right size your VMs in terms of number of vCPUs to avoid as many scheduling conflicts as possible.

 

Memory

 

vSphere deploys many techniques when managing our virtual machine memory – VMs can share memory pages with each other, eliminating redundant copies of the same memory pages.  vSphere can also compress memory as well as deploy ballooning techniques which will allow on VM to essentially borrow allocated memory to another.  This built in intelligence almost masks away any performance or issues we might see with overcommitting RAM to our virtual machines.  That said memory is often one of the resources we run out of first and we should still take precautions in order to right-size our VMs to prevent waste.  The first thing to consider is overhead – by assigning additional un-needed memory to our VMs we increase the amount of overhead memory that is utilized by the hypervisor in order to run the virtual machine, which in turns takes memory from our pool available to other VMs.  The amount of overhead is determined by the amount of assigned memory as well as the number of vCPUs on the VM, and although this number is small (roughly 150MB for 16GB RAM/2vCPU VM) it can begin to add up as our consolidation ratios increase.  Aside from memory waste overcommitted memory also causes unnecessary waste on our storage end of things as well.  Each time a VM is powered on a swap file is created on disk of equal size to the allocated memory.  Again, this may not seem like a lot of wasted space at the time but as we create more and more VMs it can certainly add up to quite a bit of capacity.  Keep in mind that if there is not enough free space available to create this swap file, the VM will not be able to be powered on.

 

Certainly these are not the only impacts that oversized virtual machines have on our environment.  They can also impact certain features such as HA, vMotion times, DRS actions, etc but these are some of the bigger ones.  Right-sizing is not something that’s done once either – it’s important to constantly monitor your infrastructure and go back and forth with application owners as things change from day to day, month to month.  Certainly there are a lot of applications and monitoring systems out there that can perform this analysis for us so use them!

 

All that said though discovering our over and under sized VMs within our infrastructure is probably the easiest leg of the journey of reclamation.  Once we have some solid numbers and metrics in hands we need to somehow present these to other business units and application owners to try and claw back resources – this is where the real challenge begins.  Placing a dollar figure on everything, and utilizing features such as showback and chargeback may help, but again, its hard to take something back after its been given.  So my questions to leave you with this time are as follows – First up, how do you ensure your VMs are right-sized?  Do you use a monitoring solution available today to do so?  And if so, how often are you evaluating your infrastructure for right-sized VMs (Monthly, Yearly, etc.)?  Secondly, what do you for see as the biggest challenges in trying to claw back resources from your business?  Do you find that it’s more of a political challenge or simply an educational challenge?

Does anyone else remember January 7th, 2000? I do.  That was when Microsoft announced the Secure Windows Initiative.  An initiative dedicated to making Microsoft products secure. Secure from malicious attack. Where secure code was at the forefront of all design practices so that way practitioners like us wouldn’t have to worry about having to patch our systems every Tuesday.  So we wouldn’t need to worry about an onslaught of viruses and malware and you name it!    That is not to say that prior to 2000 that they were intentionally writing bad code, but it is to say that they’re making a hearty and conscious decision to make sure that code IS written securely.   So was born the era of SecOps.

 

16 years has passed, and not a year has gone by in that time where I haven’t heard organizations (Microsoft included) say, “We need to write our applications securely!” like it is some new idea that they’ve discovered for the first time.   Does this sound familiar to your organizations, to your businesses and processes you have to work with?  Buzzwords come out in the market place, make it into a magazine.  Perhaps new leadership or new individuals come in and say, “We need to be doing xyzOps! Time to change everything up!”

 

But to what end do we go through that?  There was a time when people adopted good, consistent and solid practices, educated and trained their employees, well refined processes which align with the business, and technology which isn’t 10 years out of date which allows you to handle and manage your requirements.   But we could toss that out the window so we can adopt the flavor of the week as well, may as well, right?

 

That said though, some organizations, businesses or processes receive nothing but the highest accolades and benefits by adopting a different or strict regime for how they handle things.  DevOps for the right applications or organization may be the missing piece of the puzzle which could enable agility or abilities which truly were foreign prior to that point.

 

If you can share what tools did you find which brought about your success or were less than useful in realizing the dream ideal state of xyzOps.   I’ve personally found that having buy-in and commitment throughout the organization was the first step in success when it came to adopting anything which touches every element of a transformation.

 

What are some of your experiences, of organizational shifts, waxing and waning across technologies. Where it was successful, and where it was wrought with failure like adopting ITIL in its full without considering for what it takes to be successful.  Your experiences are never more important than now to show others what the pitfalls are, how to overcome challenges, and also where things tend to work out well and where they fall short.

 

I look forward to reading about your successes and failures so that we can all learn together!

In part one of this series we looked at the pain that network variation causes. In this second and final post we’ll explore how the network begins to drift and how you can regain control. 

 

How does the network drift?

It’s very hard to provide a lasting solution to a problem without knowing how the problem occurred in the first instance. Before we look at our defenses we should examine the primary causes of highly variable networks.

 

  • Time The number one reason for shortcuts is that it takes too long to do it the ‘right way’.
  • Budget Sure it’s an unmanaged switch. That means low maintenance, right?
  • Capacity Sometimes you run out of switch ports at the correct layer, so new stuff is connected the the wrong layer. It happens.
  • No design or standards The time, budget and capacity problems are exacerbated by a lack of designs or standards.

 

Let’s walk through an example scenario. You have a de-facto standard of using layer-2 access switches, and an L3 aggregation pair of chassis switches. You’ve just found out there’s a new fifth-floor office expansion happening in two weeks, with 40 new GigE ports required.

 

You hadn’t noticed that your aggregation switch pair is out of ports so you can’t easily add a new access-switch. You try valiantly to defend your design standards, but you don’t yet have a design for an expanded aggregation-layer, you have no budget for new chassis and you’re out of time. 

 

So, you reluctantly daisy chain a single switch off an existing L2 access switch using a single 1Gbps uplink. You don’t need redundancy it’s only temporary. Skip forward a few months, you’ve moved onto the next crisis and you’re getting complaints of the dreaded ‘slow internet’ from the users on the fifth floor. Erm..

 

The defense against drift

Your first defense is knowing this situation will arise. It’s inevitable. Don’t waste your time trying to eliminate variation, your primary role is to manage the variation and limit the drift. Basic capacity planning can be really helpful in this regard.

 

Another solution is to use ‘generations’ of designs. The network is in constant flux but you can control it by trying to migrate from one standard design to the next. You can use naming schemes to distinguish between the different architectures, and use t-shirt sizes for different sized sites: S, M, L, XL. 

 

At any given time, you would ideally have two architectures in place, legacy and next-gen. Of course the ultimate challenge is to age-out old designs, but capacity and end-of-life drivers can help you build the business case to justify the next gen design.

 

But how do you regain control of that beast you created on the fifth floor? It’s useful to have documentation of negative user feedback, but if you can map and measure the performance this network showing that impact, then you’ve got a really solid business case.

 

A report from a network performance tool showing loss, latency and user pain, coupled with a solid network design makes for a solid argument and strong justification for an upgrade investment.

networkautobahn

Logging and beyond

Posted by networkautobahn Jun 2, 2016

Compared to logging monitoring is nice and clean. In monitoring you are used to look at data that is already normalized. So you basically have the same look and feel for statistics from different sources like e.g.

switches, routers and servers. Of course you will have different services and checks across the different device types but some of these interface statistics can be compared easily with each other. Here you are always

looking at normalized data. In the Logging world you are facing very different Log types and formats. An interface that is down will look identically in the monitoring no matter if it is on the switch or the

connected server interface. If you now want to find the error massage for the interface that is down in the Logs of the switch and the server you will find two completely different outputs. Even how you can access

the logs is different. On a switch it is usually a ssh connection and a show command and on a windows based server eventually a RDP session and the "Event Viewer".

This is mostly a manual and time consuming process to compare the different logs with each other and find the root cause for the problem. An other problem is that many devices have only a limited storage for logs or

even worse loosing all stored logs after a reboot. Sometimes after an unexpected reboot of a device you end up with nothing in your hands to figure out what has caused the reboot.

 

We can do better with sending all the Logs to a centralized Logging server. It stores all log data independently from the origin. That reduces also the needed time for information gathering. Often you will see once all

the logs are concentrated at one point that many devices have different time stamps on their log massages. To make the logs easy consumable it is important that all log sources have the same time source and pointing

to a synchronized NTP server. Once the centralization problem is solved the biggest benefit comes from the normalization of the Log Data into logical fields that are searchable. This is something that is often done by

a SIEM solution that has been implemented to address the security aspect of logging. But I have seen a lot of SIEM projects where the centralized logging and normalization approach also improves the troubleshooting

capabilities significantly. With all the logs on the same place and format you can find dependency that are not visible in the monitoring. For example I was facing periodically reboots on a series of modular routers.

In the monitoring all the performance graphs looked normal and the router was answering to all SNMP and ICMP based checks as expected until it reboots without any warnings. So I looked into the log data and found

that 24 hours before the reboot was happening that on all of the routers a "memory error massage" was showing up.

Because the vendor needed some time to deliver a bug fix release that addressed this issue we needed a proper alarming for that. So every time we captured the " memory error massage" that triggered the reboot on the

centralized logging server we created an alarm , so that we could at least prepare a scheduled reboot that was manually initialized in a time frame when it effected less users. That was a blind spot in the monitoring

system and sometimes we can improve the alarming with the combination of Logging and active checks. So afterwards you have found the root cause for your problem ask yourself how you can prevent it from causing

an outage the next time. When there is the possibility to achieve that with logging this is worth the effort. You can start small and add more log massages over time that trigger events that are important for you.

do-not-enter-rz-lg.jpg

I've been working with SQL Server since what seems like forever ++1. The truth is I haven't been a production DBA in more than 6 years (I work in marketing now, in case you didn't know). That means I will soon hit a point in my life where I will be an ex-DBA for the same period of time as I was a production DBA (about seven years). I am fortunate that I still work with SQL Server daily, and consulted from time to time on various projects and performance troubleshooting. It helps keep my skills sharp. I also get to continue to build content as part of my current role, which is a wonderful thing because one of the best ways to learn something is to try to teach it to others. All of this means that over the years I've been able to compile a list of issues that I would consider to be common with SQL Server (and other database platforms like Oracle, no platform is immune to such issues). These are the issues that are often avoidable but not always easy to fix once they have become a problem. The trick for senior administrators such as myself is to help teams understand the costs, benefits, and risks of their application design options so as to avoid these common problems. So, here is my list of the top 5 most common problems with SQL Server.

 

Indexes

Indexes are the number one cause of problems with SQL Server. That doesn't mean SQL Server doesn't do indexes well. These days SQL Server does indexing quite well, actually. No, the issue with indexes and SQL Server have to do with how easy it easy for users to make mistakes with regards to indexing. Missing indexes, wrong indexes, too many indexes, outdated statistics, or a lack of index maintenance are all common issues for users with little to no experience (what we lovingly call 'accidental DBAs'). I know, this area covers a LOT of ground. The truth is that with a little bit of regular maintenance a lot of these issues disappear. Keep in mind that your end-users don't get alerted that the issue is with indexing. They just know that their queries are taking too long, and that's when your phone rings. It's up to you to know and understand how indexing works and how to design proper maintenance.

 

Poor design decisions

Everyone agrees that great database performance starts with great database design. Yet we still have issues with poor datatype choices, the use of nested views, lack of data archiving, and relational databases with no primary or foreign keys defined. Seriously. No keys defined. At all. You might as well have a bunch of Excel spreadsheets tied together with PowerShell, deploy them to a bunch of cluster nodes with flash drives and terabytes of RAM, and then market that as PowerNoSQL. You're welcome. It can quite difficult to make changes to a system once it has been deployed to production, making poor design choices something that can linger for years. And that bad design often forces developers to make decisions that end up with...

 

Bad Code

Of course saying 'bad code' is subjective. Each of us has a different definition of bad. To me the phrase 'bad code' covers examples such as unnecessary cursors, incorrect WHERE clauses, and a reliance on user-defined functions (because T-SQL should work similar to C++, apparently). Bad code on top of bad design will lead to concurrency issues, resulting in things like blocking, locking, and deadlocks. Because of the combination of bad code on top of poor design there has been a significant push to make the querying of a database something that can be automated to some degree. The end result has been a rise in the use of...

 

ORMs

Object-Relational Mapping (ORM) tools have been around for a while now. I often refer to such tools as code-first generators. When used properly they can work well. Unfortunately they often are not used properly, with the result being bad performance and wasted resources. ORMs are so frequent a problem that it has become easy to identify that they are the culprit. It's like instead of wiping their fingerprints from a crime scene the ORM will instead find a way to leave fingerprints, hair, and blood behind, just to be certain we know it is them. You can find lots of blog entries on the internet regarding performance problems with ORMs. One of my favorites is this one, which provides a summary of all the ways something can go wrong with an ORM deployment.

 

Default configurations

Because it's easy to click 'Next, Next, OK' and install SQL Server without any understanding about the default configuration options. This is also true for folks that have virtualized instances of SQL Server, because there's a good chance the server admins also choose some default options that may not be best for SQL Server. Things like MAXDOP, tempdb configuration, transaction log placement and sizing, and default filegrowth are all examples of options that you can configure before turning over the server to your end users.

 

The above list of five items is not scientific by any means, these are the problem that I find to be the most common. Think of them as buckets. When you are presented with troubleshooting performance, or even reviewing a design, these buckets help you to rule out the common issues and allow you to then sharpen your focus.

IT infrastructure design is a challenging topic. Experience in the industry is an irreplaceable asset to an architect, but closely following that in terms of importance is a solid framework around which to base a design. In my world, this is made clear by looking at design methodology from organizations like VMware. In the VCAP-DCD and VCDX certification path, VMware takes care to instill a methodology in certification candidates, not just the ability to pass an exam.

 

Three VCDX certification holders (including John Arrasjid who holds the coveted VCDX-001 certificate) recently released a book called IT Architect: Foundation in the Art of Infrastructure Design which serves exactly the same purpose: to give the reader a framework for doing high quality design.

 

In this post, I’m going to recap the design characteristics that the authors present. This model closely aligns with (not surprisingly) the model found in VMware design material. Nonetheless, I believe it’s applicable to a broader segment of the data center design space than just VMware-centric designs. In a follow-up post, I will also discuss Design Considerations, which relate very closely to the characteristics that follow.

 

Design Characteristics

Design characteristics are a set of qualities that can help the architect address the different components of a good design. The design characteristics are directly tied to the design considerations which I’ll discuss in the future. By focusing on solutions that can be mapped directly to one (or more) of these five design characteristics and one (or more) of the four considerations that will follow, an architect can be sure that there’s actually a purpose and a justification for a design decision.

 

It’s dreadfully easy – especially on a large design – to make decisions just because it makes sense at first blush. Unfortunately, things happen in practice that cause design decisions to have to be justified after the fact. And if they’re doing things correctly, an organization will require all design decisions to be justified before doing any work, so this bit is critical.

 

Here’s the 5 design characteristics proposed by the authors of the book:

 

Availability – Every business has a certain set of uptime requirements. One of the challenges an architect faces is accurately teasing these out. Once availability requirements are defined, design decisions can be directly mapped to this characteristic.

 

For example, “We chose such and such storage configuration because in the event of a loss of power to a single rack, the infrastructure will remain online thus meeting the availability requirements.”

 

Manageability – This characteristic weighs the operational impacts that a design decision will have. A fancy architecture is one thing, but being able to manage it from a day-to-day perspective is another entirely. By mapping design decisions to Manageability, the architect ensures that the system(s) can be sustainably managed with the resources and expertise available to the organization post-implementation.

 

For example, “We chose X Monitoring Tool over another option Y because we’ll be able to monitor and correlate data from a larger number of systems using Tool X. This creates an operational efficiency as opposed to using Y + Z to accomplish the same thing.”

 

Performance – As with availability, all systems have performance requirements, whether they’re explicit or implicit. Once the architect has teased out the performance requirements, design decisions can be mapped to supporting these requirements. Here’s a useful quote from the book regarding performance: “Performance measures the amount of useful work accomplished within a specified time with the available resources.”

 

For example, “We chose an all-flash configuration as opposed to a hybrid configuration because the performance requirements mandate that response time must be less than X milliseconds. Based on our testing and research, we believe an all-flash configuration will be required to achieve this.”

 

Recoverability – Failure is a given in the data center. Therefore, all good designs take into account the ease and promptness with which the status quo will be restored. How much data loss can be tolerated is also a part of the equation.

 

For example, “Although a 50 Mbps circuit is sufficient for our replication needs, we’ve chosen to turn up a 100 Mbps circuit so that the additional bandwidth will be available in the event of a failover or restore. This will allow the operation to complete within the timeframe set forth by the Recoverability requirements.”

 

Security – Lastly - but certainly one of the most relevant today - is Security. Design decisions must be weighed against the impact they’ll have on security requirements. This can often be a tricky balance; while a decision might help improve results with respect to Manageability, it could negatively impact Security.

 

For example, “We have decided that all users will be required to use two-factor authentication to access their desktops. Although Manageability is impacted by adding this authentication infrastructure, the Security requirements can’t be satisfied without 2FA.”

 

Conclusion

I believe that although infrastructure design is much an art as it is a science – as the name of the book I’m referencing suggests – leveraging a solid framework or lens through which to evaluate your design can help make sure there aren’t any gaps. What infrastructure design tools have you leveraged to ensure a high quality product?

Think about your network architecture. Maybe it's something older that needs more attention. Or perhaps you're lucky enough to have something shiny and new. In either case, the odds are very good that you have a few devices in your environment that you just can't live without. Maybe it's some kind of load balancer or application delivery controller. Maybe it's an IP Address Management (IPAM) device that was built ages ago but hasn't been updated in forever.

The truth of modern networks is that many of them rely on devices like this as a lynchpin to keep important services running. If those devices go down, so too do the services that you provide to your users. Gone are the days when a system could just be powered down until the screaming started. Users have come to rely on the complicated mix of products in their environment to perfect their workflows to do the most work possible in the shortest amount of time. So how can these problem devices be dealt with?

Know Your Enemy

First and foremost, you must know about these critical systems. You need to have some kind of monitoring system in place that can see when these devices are performing at peak efficiency or when they aren't doing so well. You need to have a solution that can look outside simple SNMP strings and give you a bigger picture. What if the hard drive in your IPAM system is about to die? What about when the network interface on your VPN concentrator suddenly stops accepting 50% of the traffic headed toward it? These are things you need to know about ASAP so they can be fixed with a minimum of effort.

Your monitoring solution should help you keep track of these devices while giving you plenty of options for alerts. A VPN concentrator isn't going to be a problem if it's offline during the workday. But if it goes down the night before the quarter reports are due, the CFO is going to be calling and need answers. Make sure you can configure your device profiles with alert patterns that give you a chance to fix things before they become problems. Also make sure that the alerts help you keep track of the individual pieces of the solution, not just the up or down status of the whole unit.

Be Ready To Replace

The irony of being stuck with these "problem children" types of devices is that they are the ones that you want to replace more than anything but can't seem to find a way to remove. So how can you advocate for the removal of something so critical?

The problem with these devices is not that the hardware itself is indispensable. It's that the service the hardware (or software) provides is critical. Services can be provided in many different ways. So long as you know what service is being provided, you can create an upgrade path to remove hardware before it gets to the "problem child" level of annoyance.

Most indispensable services and devices get that way because no one is keeping track of who is using them or how they are being used. Workflows created to accomplish a temporary goal often end up becoming a permanent fixture. It's important to keep a record of all the devices in your network and know how often they are being used. Regularly update that list to know what has been recently accessed and for how long. If the device is something that is scheduled to be replaced soon, a preemptive email about the service change will often find a few laggard users that didn't realize they were even using the device. That will help head off any calls after it has been decommissioned and retired to a junk pile.

Every network has problem devices that are critical. The trick to keeping them from becoming real problems lies less in trying to do without them and more with knowing how they are performing and who is using them. With the right solutions in place to keep a wary eye on them and a plan in place to replicate and eventually replace the services they provide, you can sleep a bit better a night knowing that your problem children will be a little less problematic.

sqlrockstar

The Actuator - June 1st

Posted by sqlrockstar Employee Jun 1, 2016

Welcome to June! 2016 is almost half over so now might be a good time to go back to those resolutions from New Year's Day and make some edits so you won't feel as awful when December comes around.

 

Anyway, here is this week's list of things I find amusing from around the internet...

 

LinkedIn's Password Fail

Let's turn lemons into lemonade here and look at it this way: LinkedIn found a way to get everyone to remember LinkedIn still exists!

 

Massive Infographic of Star Wars A New Hope

WARNING: DO NOT CLICK UNLESS YOU HAVE TIME TO KILL AND TIME TO SCROLL!

 

Microsoft Cuts More Jobs In Troubled Mobile Unit

After reading this I had but one question: Microsoft still has a mobile unit? I've never met a non-Microsoft employee with a Windows Phone. Never.

 

Microsoft bans common passwords that appear in breach lists

This is wonderful to see, and a great use of taking data they are collecting and using it in a positive way for the benefit of their customers.

 

The Mind-Boggling Pace of Computing

The PS4 will have 150x the computing power of IBM's Deep Blue. Eventually there will come a time when technology will become obsolete the second it appears.

 

China's 'Air Bus' Rides Above The Traffic

I don't see this as a practical upgrade to public transportation, but I do give it high marks for creativity.

 

Ad blocking: reaching a tipping point for advertisers, publishers and consumers

I do not understand why anyone is surprised that consumers don't want to see annoying ads. The content creators should be finding ways to engage me in a way that I want to see their content.

 

a - 1.jpg

Leon Adato

Submit Your Slogans!

Posted by Leon Adato Expert Jun 1, 2016

fix_computer.jpg

 

Some of the best conversations I've had at trade shows started with a button.

 

People swing by our booth at VMWorld, MS:Ignite, or CiscoLive (psssst! We'll be there again this coming July 10-14). People wander by and their eye is caught by a button proclaiming:

I_am_root.jpg

 

or a sticker that says

Your_F1.jpg

 

...and they have to have it. And then they have to talk about why they have to have it. And before you know it, we're having a conversation about monitoring, or configuration management, or alert trigger automation.

 

And that's what conventions are all about, right? Finding common ground, connecting with similar experiences, and sharing knowledge and ideas.

 

It has gotten to the point where (and I swear I'm not making this up) people rush up to the booth to dig through our buckets of buttons and piles of stickers, looking for the ones they don’t have, yet.

 

Here’s a secret: One of my favorite things about working at SolarWinds is that we have whole meetings dedicated to brainstorming new phrases. I get a rush of excitement when something I suggested is going to be on the next round of convention swag.

 

And for those of us who have the privilege of going to shows and seeing people eagerly snatch up their new prize? The pride we feel when OUR submission is the thing that is giving attendees that burst of joy is akin to watching your baby take her first steps.

 

We want to share that experience with you, our beloved THWACK community.

 

Now this is nothing new. We’ve made requests for slogans before, most recently here and here. The difference is that we’re turning this into an ongoing “campaign” on Thwack.

 

Submit your slogan suggestions for buttons, stickers, or even t-shirts here: https://thwack.solarwinds.com/community/solarwinds-community/fun-and-geeky. Once a quarter a top-secret panel of judges will convene and review the entries, picking the top 5 entries.

 

If one of your submissions is chosen to be immortalized, you will receive - along with immeasurable pride and boundless joy – 1,000 THWACK points. Not only that, but you will earn your slogan as a badge in your THWACK profile. Best of all, you can submit and win as many times as your creativity and free time at work will allow.

 

So let those creative juices flow and unleash your enthusiasm for all things IT and geeky. We can't wait to see what you come up with.

Corporate Dilemma.jpg

 

 

Have you ever faced this problem before?   Whether you are brand new to the industry, technology or a position; or you have been fighting out issues, troubleshooting in the trenches for decades upon decades. This problem often rears its ugly head, often in the form of “We don’t have the budget to train our employees”, or as the comic captioned above so eloquently puts it, “If we train our employees they might leave.”

 

This is a quandary I personally have experienced in my life time and one I know technician, to engineer, to architect, to all roles within an organization who have equally faced. It is not to say that all organizations suffer this same fate, technology vendors often ones who have their own certification program tend to support education, training, certification. Partner or Resellers often tend to support and embrace education within the workplace.  Even furthermore organizations where they dictate requirements around “Your position must have ‘x’ education/certification in order to advance within the organization”.  It would be awkward to have requirements yet no support of the organization to achieve them.

 

Yet even given those few scenarios where I have seen them be supportive of training, I could pull a dozen Administrators into a room, System, Network, SysOps and DevOps and probably 2 to 3 out of each of those groups would say their organization supports education, whether financially, providing time, or training and resources to pursue this education.

 

I can think of more admins than not who spend countless hours educating themselves, searching out and researching problems, constantly staying up on tomorrows technology while supporting the systems of yesterday, even those who regularly read up on and comment on forums like this within the Thwack community.   You are the heroes, the rock stars, those who in spite of an organizations support of your actions you continue to pursue your own evolution.

 

I’ve published in the past countless resources for others to educate themselves, free or discounted certification programs, even one I’ll mention here where SolarWinds has a free training & certification program to become a SolarWinds Certified Professional  (SCP) Interested in becoming a SolarWinds Certified Professional FOR FREE?!?!?

(I just checked and it seems to still work [Someone let me know if it's not still free!], two years on from when I wrote this blog post, so worth adding to your arsenal if you’re Cert limited and want to know what it feels like to have resources to learn something and then get a cert around it )   But enough about me!

 

What are some of the ways you the people are able to keep up to date on things, to continually grow and educate yourselves.   Share your experiences of how you cope with organizations who do not support your advancement for fear you may ‘jump ship’ with your new knowledge.   Or if you are some of the few who have a supportive organization whether Company, Partner, Vendor who gives you the fuel to light the fire of your mind, or just to continue to support and maintain your existing environments.   And anywhere in between.

In my last post, we talked about the Business going outside of your I.T. controls and self-provisioning Software as a Service solutions. Most of you were horrified and could identify a number of internal policies that a SaaS solution wouldn’t comply with. You understood that these policies are there to protect the organization’s confidential information or intellectual property, so why is it so hard for the Business to grasp those implications?

 

I’ve recently just finished reading “The Phoenix Project” by Gene Kim, Kevin Behr and George Spafford. Actually, I couldn’t put it down.

 

One of the characters is an over-zealous Chief Security Officer who wants to tie I.T. down so tightly, to meet every point in a third-party security audit. In reality, the business actually has processes and procedures in place in the finance department that mean that these controls in the I.T. system are actually unnecessary.

 

In fact, it made better business sense for a human responsible for the money in the organization to watch out for these red flags instead of coding the computer systems to do it. I’m not saying that’s going to be the case in every situation, and I.T. controls certainly have their place in mitigating organizational risk, but do they have to prevent every possible risk?

 

The UK government’s Centre for Protection of National Infrastructure (CESG) is now advising that passwords should only be changed ‘on indication or suspicion of compromise’, throwing the old 30 day or 90 day expiry out of the window. While that seems insane, they say regular password changes force people to store them somewhere to remember them or re-use the same base with a minor variation.

 

Tough I.T. controls or policies can often lead to people inventing workarounds. I can bet you that someone in the organization has given their password out to a co-worker because they were away sick, and it was quicker than asking I.T. to sort out delegated access to their mailbox. Or perhaps they ran late at a meeting and someone needed to unlock their computer.

 

If you could wave a magic wand, what I.T. policies or controls would you relax to make life easier for you and the end users? Could you do this and retain (or even improve) the security and stability of your systems? Or is this all just crazy talk and you should be locking down your systems even more?

 

Let me know your thoughts!

 

-SCuffy

Tomorrow is the first day of the Caribbean hurricane season, and that means: named storms, power outages and the need for IT emergency preparedness. And now is a great time to make sure your disaster toolbox is well stocked, before a major calamity strikes. And as a federal IT manager, you always have to be prepared for the unnatural disaster, such as a cyber-attack.

 

The scary thing is that even the idea of creating a disaster recovery plan has been put on the backburner at many government agencies. In fact, according to a federal IT survey we conducted last year, over 20 percent of respondents said they did not have a disaster preparedness and response plan in place.

 

We suggest that you make sure you have a plan in place, and follow these best practices:

 

Continuously monitor the network. Here’s a phrase to remember: “collect once, report to many.” This means installing software that automatically and continuously monitors IT operations and security domains, making it easier for federal IT managers to pinpoint – or even proactively prevent – problems related to network outages and system downtime.

 

Continuous monitoring can give IT professionals the information needed to detect abnormal behavior much faster than manual processes. This can help federal managers react to these challenges quickly and reduce the potential for extended downtime.

 

Monitor devices, not just the infrastructure. You need to keep track of all of the devices that impact your network, including desktops, laptops, smartphones and tablets.

 

For this, consider implementing tools that can track individual devices. First, devise a whitelist of devices acceptable for network access. Then, set up automated alerts that notify you of non-whitelisted devices tapping into the network or any unusual activity. Most of the time, these alerts can be tied directly to specific users. This tactic can be especially helpful in preventing those non-weather-related threats I referred to earlier.

 

Plan for remote network management. There’s never an opportune time for a disaster, but some occasions are just, well, disastrous. For example, when a hurricane knocks out electricity in your data center and you’re stuck at home thinking, “Yeah, right.” In such cases, you’ll want to make sure you have software that allows you to remotely manage and fix anything that might adversely impact your network.

 

Remote management technology typically falls into two categories: in-band and out-of-band remote management. Both get the job done for their particular circumstances. And, there are some instances where remote management is insufficient. It’s perfectly adequate when your site loses power, or your network goes offline, but in the face of a major catastrophe – massive floods, for example – you’ll need onsite management. In many cases, however, remote management tools will be more than enough to get you through some rough spots without you having to get to the office.

 

Each of these best practices, and the technologies associated with them, are like backup generators. You may never need to use them, but when and if you do, you’ll be glad you have them at your disposal.

 

Find the full article on Government Computer News.

geekspeak_memorialday.png

The DevOps Days - Washington, DC event is right around the corner and it turns out I happen to be sitting on 3 tickets. You know what that means!

 

As I did for the Columbus DevOps Days, I'm going to be giving these babies away 100% free to the first people who post a selfie.

 

A selfie that is worthy of it's own $55,000 kickstarter. A selfie that shows your creativity, humor, patriotism, and flair.

 

That's right, I want a

SUPER MEGA GEEKY WARHEAD MEMORIAL DAY SELFIE

 

Now I'm not here to tell you what that means. (heck, I just made up that sentence 10 seconds ago). Heck, even if you don't celebrate Memorial day (looking at you, jbiggley), but you can get your Geeky cheeks down to DC next week, those tickets could be yours.

 

All I'm saying is that if you are able to attend the DevOps Days event in DC on June 8 & 9, and you post a selfie in the comments below, then one of those golden tickets is yours yours YOURS!!

 

So, have a safe, responsible, geeky, fun, relaxing, non-stressful Memorial Day (or as you call it in other countries, "Monday") and start selfie-ing!

Filter Blog

By date:
By tag: