Skip navigation

In my previous post, I reviewed the 5 Infrastructure Characteristics that will be included as a part of a good design. The framework is layed out in the great work IT Architect: Foundations in the Art of Infrastructure Design. In this post, I’m going to continue that theme by outlining the 4 Considerations that will also be a part of that design.

 

While the Characteristics could also be called “qualities” and can be understood as a list of ways by which the design can be measured or described, Considerations could be viewed as the box that defines the boundaries of the design. Considerations set things like the limits and scope of the design, as well as explain what the architect or design team will need to be true of the environment in order to complete the design.

 

Design Considerations

I like to think of the four considerations as the four walls that create the box that the design lives in. When I accurately define the four different walls, the design to go inside of it is much easier to construct. There are less “unknowns” and I leave myself less exposed to faults or holes in the design.

 

Requirements – Although they’re all very important, I would venture to say that Requirements is the most important consideration. “Requirements”is  a list - either identified directly by the customer/business or teased out by the architect – of things that must be true about the delivered infrastructure. Some examples listed in the book are a particular Service Level Agreement metric that must be met (like uptime or performance) or governance or regulatory compliance requirements. Other examples I’ve seen could be usability/manageability requirements dictating how the system(s) will be interfaced with or a requirement that a certain level of redundancy must be maintained. For example, the configuration must allow for N+1, even during maintenance.

 

Constraints – Constraints are the considerations that determine how much liberty the architect has during the design process. Some projects have very little in the way of constraints, while others are extremely narrow in scope once all of the constraints have been accounted for. Examples of constraints from the book include budgetary constraints or the political/strategic choice to use a certain vendor regardless of other technically possible options. More examples that I’ve seen in the field include environmental considerations like “the environment is frequently dusty and the hardware must be able to tolerate poor environmentals” and human resource constraints like “it must be able to be managed by a staff of two.”

 

Risks – Risks are the architect’s tool for vetting a design ahead of time and showing the customer/business the potential technical shortcomings of the design imposed by the constraints. It also allows the architect to show the impact of certain possibilities outside the control of either the architect or the business. A technical risk could be that N+1 redundancy actually cannot be maintained during maintenance due to budgetary constraints. In this case, the risk is that a node fails during maintenance and puts the system into a degraded (and vulnerable) state. A risk that is less technical might be something like that the business is located within a few hundred yards of a river and flooding could cause a complete loss of the primary data center. When risks are purposely not mitigated in the design, listing them shows that the architect thought through the scenario, but due to cost, complexity, or some other business justification, the choice has been made to accept the risk.

 

Assumptions – For lack of a better term, an assumption is a C.Y.A. statement. Listing assumptions in a design shows the customer/business that the architect has identified a certain component of the big picture that will come into play but is not specifically addressed in the design (or is not technical in nature). A fantastic example listed in the book is an assumption that DNS infrastructure is available and functioning. I’m not sure if you’ve tried to do a VMware deployment recently, but pretty much everything beyond ESXi will fail miserably if DNS isn’t properly functioning. Although a design may not include specifications for building a functioning DNS infrastructure, it will certainly be necessary for many deployments. Calling it out here ensures that it is taken care of in advance (or in the worst case, the architect doesn’t look like a goofball when it isn’t available during the install!).

 

If you work these four Considerations (and the 5 Characteristics I detailed in my previous post) into any design documentation you’re putting together, you’re sure to have a much more impressive design. Also, if you’re interested in working toward design-focused certifications, many of these topics will come into play. Specifically, if VMware certification is of interest to you, VCIX/VCDX work will absolutely involve learning these factors well. Good luck on your future designs!

Well, Britain has voted to leave the EU. I have no idea why, or what that means other than my family vacation to London next month just got a whole lot cheaper.

 

Anyway, here is this week's list of things I find amusing from around the Internet. Enjoy!

 

EU Proposal Seeks To Adjust To Robot Workforce

Maybe this is why the UK wanted to leave, because they don't want their robots to be seen as "electronic persons with specific rights and obligations."

 

Real-time dashboards considered harmful

This is what adatole and I were preaching about recently. To me, a dashboard should compel me to take action. Otherwise it is just noise.

 

Many UK voters didn’t understand Brexit, Google searches suggest

I wont' pretend to know much about what it means, either. I'm hoping there will be a "#Brexit for Dummies" book available soon.

 

UK Must Comply With EU Privacy Law, Watchdog Argues

A nice example of how the world economy, and corporate business, is more global than people realize. Just because Britain wants to leave the EU doesn't mean they won't still be bound by EU rules should they wish to remain an economic partner.

 

Hacking Uber – Experts found dozen flaws in its services and app

Not sure anyone needed more reasons to distrust Uber, but here you go.

 

History and Statistics of Ransomware

Every time I read an article about ransomware I take a backup of all my files to an external drive because as a DBA I know my top priority is the ability to recover.

 

Blade Runner Futurism

If you are a fan of the movie, or sci-fi movies in general, set aside the time to read through this post. I like how the film producers tried to predict things like the cost of a phone call in the future.

 

Here's a nice reminder of the first step in fixing any issue:

 

IMG_3115.JPG

The Pareto Principle

 

The Pareto principle, also known as the 80-20 principle, says that 20% of the issues will cause you 80% of the headaches. This principle is also known as The Law of the Vital Few. In this post, I'll describe how the Pareto principle can guide your work to provide maximum benefit. I'll also describe a way to question the information at hand using a technique known as 5 Whys.

 

The 80-20 rule states that when you address the top 20% of your issues, you'll remove 80% of the pain. That is a bold statement. You need to judge its accuracy yourself, but I've found it to be uncannily accurate.

 

The implications of this principle can take a while to sink it. On the positive side, it means you can make a significant impact if you address the right problems. On the down side, if you randomly choose what issues to work on, it's quite likely you're working on a low-value problem.

 

Not quite enough time

 

When I first heard of the 80-20 rule I was bothered by another concern: What about the remaining problems? You should hold high standards and strive for a high-quality network, but maintaining the illusion of a perfect network is damaging. If you feel that you can address 100% of the issues, there's no real incentive to prioritize. I heard a great quote a few months back:

 

     "To achieve great things, two things are needed; a plan, and not quite enough time." - Leonard Bernstein

 

We all have too much to do, so why not focus our efforts on the issues that will produce the most value? This is where having Top-N reports from your management system is really helpful. Sometime you need to see the full list of issues, but only occasionally. More often, this restricted view of the top issues is a great way to get started on your Pareto analysis.

 

3G WAN and the 80-20 rule

 

A few years back, I was asked to design a solution for rapid deployment warehouses in remote locations. After an analysis of the options I ran a trial using a 3G-based WAN. We ran some controlled tests, cutting over traffic for 15 minutes, using some restrictive QoS policies. The first tests failed with a saturated downlink.

 

When I analyzed the top-talkers report for the site I saw something odd. It seemed that 80% of the traffic to the site was print traffic. It didn't make any sense to me, but the systems team verified that the shipping label printers use an 'inefficient' print driver.

 

At this point I could have ordered WAN optimizers to compress the files, but we did a 5 Whys analysis instead. Briefly, '5 Whys' is a problem solving technique that helps you identify the true root cause of issues.

 

  • Why is the bandwidth so high? - Printer traffic taking 80% of bandwidth
  • Why is printer traffic such a high percentage? - High volume of large transactions
  • Why is the file size so large? - Don't know - oh yeah we use PostScript (or something)
  • Why can't we use an alternative print format? - We can, let's do it, yay, it worked!
  • Why do we need to ask 5 whys? - We don't, you can stop when you solve the problem

 

The best form of WAN optimization is to suppress or redirect the demand. We don't all have the luxury of a software engineer to modify their code and reduce bandwidth, but in this case it was the most elegant solution. We were able to combine a trial, reporting, top-N and deep analysis with a flexible team. The result was a valuable trial and a great result.

 

Summary

 

Here's a quick summary of what I covered in this post:

 

  • The 80/20 principle can help you get real value from your efforts.
  • Top-N reports are a great starting point to help you find that top 20%.
  • The 5 Whys principle can help you dig deeper into your data and choose the most effective actions.

 

Of course a single example doesn't prove the rule.  Does this principle ring true for you, or perhaps you think it is nonsense? Let me know in the comments.

Let’s face it!  We live in a world now where we are seeing a heavy reliance on software instead of hardware.  With Software Defined Everything popping up all over the place we are seeing traditional hardware oriented tasks being built into software – this provides an extreme amount of flexibility and portability on how we chose to deploy and configure various pieces of our environments.

 

With this software management layer taking hold of our virtualized datacenters we are going through a phase where technologies such as private and hybrid cloud are now within our grasp.  As the cloud descends upon us there is one key player that we need to focus on – the automation and orchestration that quietly executes in the background, the key component to providing the flexibility, efficiency, and simplicity that we as sysadmins are expected to provide to our end users.

 

To help drive home the importance and reliance of automation let’s take a look at a simple task – that of deploying a VM.  When we do this in the cloud, mainly public,  it’s just a matter of swiping a credit card, providing some information in regards to a name and network configuration, waiting a few minutes/seconds and away we go. Our end users can have a VM setup almost instantaneously!

 

The ease of use and efficiency of the public cloud, such as the above scenario is putting extended pressure on IT within their respective organizations – we are now expected to create, deliver and maintain these flexible like services within our businesses, and do so with the same efficiency and simplicity that cloud brings to the table.  Virtualization certainly provides a decent starting point for this, but it is automation and orchestration that will take us to the finish line.

 

So how do we do it?

 

Within our enterprise I think we can all agree that we don’t simply just create a VM and call it “done”!  There are many other steps that come after we power up that new VM.  We have server naming to contend with, networking configuration (IP, DNS, Firewall, etc).  We have monitoring solutions that need to be configured in order to properly monitor and respond to outages and issues that may pop up, as well as I’m pretty certain we will want to include our newly created VM within some sort of backup or replication job in order to protect it.  With more and more software vendors exposing public API’s we are now living in a world where its possible to tie all of these different pieces of our datacenter together.

 

Automation and orchestration doesn’t stop at just creating VMs either – there’s room for it throughout the whole VM life cycle.  The concept of the self-healing datacenter comes to mind – having scripts and actions performed automatically by monitoring software in efforts to fix issues within your environment as the occur – this is all made possible by automation.

 

So with this I think we can all conclude that automation is a key player within our environments but the questions always remains – should I automate task x?  Meaning, will the time savings and benefits of creating the automation supersede the efforts and resources it will take to create the process?  So with all this in mind I have a few questions- Do you use automation and orchestration within your environment?   If so what tasks have you automated thus far?  Do you have a rule of thumb that dictates when you will automate a certain task?  Believe it or not there are people within this world that are somewhat against automation, whether it be in fear of their jobs or simply not adapting – how do you help “push” these people down the path of automation?

Government information technology administrators long have been trained to keep an eye out for the threats that come from outside their firewalls. But what if the greatest threats actually come from within?

 

According to a federal cybersecurity survey we conducted last year, that is a question that many government IT managers struggle to answer. In fact, a majority of the 200 respondents said they believe malicious insider threats are just as damaging as malicious external threats.

 

The threat of a careless user storing sensitive data on a USB drive left on a desk can raise just as much of a red flag as an anonymous hacker. Technology, training and policies must be consistently deployed, and work together, to ensure locked-down security.

 

Technology

 

Manual network monitoring is no longer feasible, and respondents identified tools pertaining to identity and access management, intrusion prevention and detection, and security incident and event management or log management as “top tier” tools to prevent internal and external threats.

 

Each solution offers continuous and automatic network monitoring, and alerts. Problems can be traced to individual users and devices, helping identify the root cause of potential insider threats. Most importantly, administrators can address potential issues far more quickly.

 

However, tools are just that—tools. They need to be supported with proper procedures and trained professionals who understand the importance of security and maintaining constant vigilance. 

 

Training

 

According to the survey, 53 percent of respondents claim careless and untrained insiders are the largest threat at federal agencies, while 35 percent stated “lack of IT training” is a key barrier to insider threat detection. IT personnel should be trained on technology protocols and the latest government security initiatives and policies and receive frequent and in-depth information on agency-specific initiatives that could impact or change the way security is handled throughout the organization.

 

All employees should be aware of the dangers and costs of accidental misuse of agency information or rogue devices. Forty-seven percent of survey respondents stated employee or contractor computers were the most at-risk sources for data loss. Human error often can prove far more dangerous than explicit intent.

 

Policies

 

When it comes to accidental or careless insider threats, 56 percent of survey respondents were somewhat confident in their security policies, while only 31 percent were “very confident.” 

 

Agency security policies, combined with federal policies, serve as a security blueprint and are therefore extremely important. They should plainly outline the agency’s overall security approach and include specific details such as authorized users and use of acceptable devices.

 

As one of the survey respondents said: “Security is a challenge, and the enemy is increasingly sophisticated.” More and more, the enemy attacks from all fronts—externally and internally. Federal IT managers clearly need to be prepared to combat the threat using their own three-pronged attack of technology, training and policies.

 

Find the full article on Signal.

The short answer to the question in the title is NO, backup speed and restore speed are no longer related.  There are a number of reasons why this is the case.

 

Let's go back in time to understand the historical reasons behind this question.  Historically, backup was a batch process that was sent to a serial device.  Various factors led to the commonly used rule of thumb that restores took 50% to 100% longer than the full backup that created them.  This started with the fact that a restore started with first reading the entire full backup, which at a minimum would take the same amount of time as creating the full backup.  Then once that happened multiple incremental backups had to be read, each of which added time to the restore due to them time involved in loading multiple tapes.  (It wasn't that long ago that all backups were to tape.)   Also because backups were sent to tape, it was not possible to do the kind of parallel processing that today's restores are capable of.

 

The first reason why backup and restore speed are no longer related is actually negative.  Today's backups are typically sent to a device that uses deduplication.  While deduplication comes with a lot of benefits, it also can come with one challenge.  The "dedupe tax," as its referred to, is the difference between a device's I/O speed with and without deduplication.  Depending on how dedupe is done, backup can be much faster than restore and vice versa.

 

The second -- and perhaps more important -- reason why backup and restore speed are unrelated is that backups and restores don't always use the same technology any more.  Where historically both backups and restores were a batch process that simply copied everything from A to B, today's backups and restores can actually be very different from each other.  A restore may not even happen, for example.  If someone uses a CDP or near-CDP product, a "restore" may consist of pointing the production app to the backup version of that app until the production version of that app can be repaired.  Some backup software products also have the ability to do a "reverse restore" that identifies the blocks or files that have been corrupted and only transfer and overwrite those blocks or files.  That would also be significantly faster than a traditional restore.

 

One thing hasn't changed: the only way you can know the speed at which a restore will run is to test it.  Sometimes the more things change the more they stay the same.

It's it is a general rule to have one backup methodology or product if that is possible.  But it is also true that it is not always possible or even advisable in any given situation.

 

The perpetuation of virtualization is a perfect example.  For too many years, the virtualization backup capabilities of the leading backup products could be described as anything but "leading."  This led to an entire sub-category of products designed specifically for backing up virtual systems.  Unfortunately, these same products have generally eschewed support for physical systems.  Since most customers have both virtual and physical servers, this ends up requiring them to purchase and manage multiple backup products.

 

Another reason for a customer purchasing multiple backup products may be they have an application that requires the capabilities of a CDP or near-CDP product.  These products can provide very low-impact backups and extremely fast "instant" restores, so many companies have started using them for their applications demanding tight RTOs and RPOs.  But they're not necessarily ready to replace all of the backup software they've already purchased in favor of this new way of doing backup.  This again leaves them with the requirement to manage multiple products.

 

There are many challenges with using multiple backup products, the first of which is that they all behave differently and should be configured differently.  One of the most common ways to make a backup system work poorly is treat it like a different product.  TSM and NetBackup couldn't be more different than one another, but many people move from one of these to the other -- and still try to configure the new product like it is the old product.  The solution to this is simple: get training on the new product and consider hiring -- at least tempoarily -- the services of a specialist in that product to make sure you are configuring it the way it likes to be configured.

 

Another challenge is that each product reports on how it is performing in different ways.  They may use different metrics, values, and terms.  They also use different delivery mechanisms.  One may use email, where another may use XML or HTML to report backup success or failure.  The key here is to use a third party reporting system that can collect and analyze the various product and normalize them into a single reporting system.

 

Avoid having multiple backup products when you can.  When you can't, employ significant amounts of training and look into a third-party reporting tool that can handle all of the products you are using.

This is one of the common questions IT pros face in their job every day. Whether yours is a large organization with different teams for network and application management, or a small IT department with one IT guy trying to figure out where the problem is, the challenge of identifying the root cause is the same. Though the network and servers are part of the same IT infrastructure, they are two very different entities when it comes to monitoring and management.

 

Take the example of end-users complaining of slow Web connectivity and performance. Where do you think the problem is?

  • Is there a problem on the in-house network?
  • Is it a fault on the ISP’s side?
  • What about the IIS server?
  • Is it the virtual environment hosting IIS?

 

There could be many more possibilities to investigate. Making a list of questions and manually checking each part of the infrastructure is definitely what you don’t want to do. Visibility into the network and the application side of the house is key. Integrated visibility of network and application performance is the best thing an IT admin could ask for in this scenario.

 

As you probably know, Network Performance Monitor (NPM) v12 is out with NetPath™ technology. The power of NPM can be enhanced when integrated with Server & Application Monitor (SAM). In the image below, I’m making a modest attempt to explain the value of end-to-end visibility that you get across the service delivery path, from inside your network to outside your network, from on-premises LAN across the WAN and cloud networks.

 

PROBLEM/USE CASE

Users in both the main and branch offices are reporting slow performance of intranet sites. What is causing the problem and why? Is it the network, or the systems infrastructure that needs troubleshooting, or both?

NPM+SAM Image.PNG

HOW THE ROOT CAUSE WAS FOUND USING NPM and SAM

  • Using hop-by-hop critical path analysis available with NetPath in NPM to isolate problem in the ISP network.
  • Using application and server monitoring in SAM to pinpoint performance issue in the Web server environment.
  • The Quality of Experience (QoE) dashboard available in and common to both NPM and SAM analyzes network packets to provide insight into network and application response times for specific application traffic. This helps confirm the symptoms of the problem at both sides—network and the application.

 

BENEFITS USING NPM and SAM TOGETHER

  • One unified Web interface and NOC for monitoring and troubleshooting networks and applications infrastructure.
  • Eliminate finger-pointing between network and system teams. Find out exactly where the root cause is for faster troubleshooting.
  • NPM and SAM are built on the Orion® Platform, sharing common services such as centralized alerting, reporting, and management.
  • Easy to install, set up, and use. Automatically discover your environment and start monitoring.

 

If you already are a user of NPM and SAM, please comment below on how you benefitted from using them together?

Kansas City VMUG USERCONCarolina VMUG USERCON
KC VMUG USERCON.jpgCarolina VMUG USERCON.jpg

 

As I promised, I am posting the presentation from my speaking session at the two VMUG USERCONs - Carolina and Kansas City in June. I am thankful to all of the VMUG community members who stopped by to converse with chrispaap and jhensle about SolarWinds and monitoring with discipline. I am so appreciative of the packed rooms of attendees at both events, who decided to join me in my speaking session. I hope that I gave you something meaningful and valuable, especially since the community has been given me so much.

 

Attached is my slide deck. Let me know what you think in the comment section below. Thanks.

Every day, help desk pros stay busy by tracking down tickets, organizing them, assigning resources, and updating statuses. Have you ever wondered if being so busy is a good thing? Are you doing the right things at the right time?

 

Today, the increasing adoption of evolving technologies, such as BYOD and enterprise mobility, and the growing mobile workforce, require help desk pros to be super-productive in delivering IT support anywhere, anytime. To meet today’s rapidly growing end-user needs, help desk pros need to save time by cutting down trivial tasks, such as organizing tickets, and spend more time resolving issues and delivering real value to customers.

 

A typical IT service request lifecycle looks like this:

 

Picture1.png

 

In general, help desk technicians spend more time in the first half of the lifecycle, when they should be focusing more on the latter half, which drives results and delivers value to customers. Here are a few simple tips you can follow to help save time in your daily help desk operations:

 

  1. Ticket funneling: Create a centralized help desk system that can let your users submit tickets easily via email or online, auto-populate information provided by users to help technicians determine the severity of the issue, and automatically alert users about the nature of the issue and estimated completion time.
  2. Ticket prioritization: Configure your help desk system to automatically categorize tickets based on their criticality, end-user priority, technical expertise, and more. This will help you instantly identify the nature of the issue, understand the business impact, set Service Level Agreements (SLAs), and prioritize tickets that need your time today.
  3. Ticket routing: End-users often blame help desk pros when their issues aren’t quickly resolved. But the fact is, one can’t expect a help desk admin to simultaneously fix a network issue, replace a faulty projector, and help with a password reset. Based on issue type and criticality, you need to assign tickets to technicians who have expertise in handling those specific issues. This can be achieved by setting up automated workflows in your help desk system that can help route trouble tickets and assign them to the right technician at the right time.
  4. Reduce time-to-resolution: Clearly, end-users want their issues resolved as soon as possible. To do this, the IT pro may need to access the end-user’s PC remotely, get more information from users, restart servers, etc. Ideally, your help desk should seamlessly integrate with remote support and communication and troubleshooting tools to help you get all the information you need quickly to resolve issues faster.
  5. Asset mapping: Gathering asset details, licensing information, data about the hardware and software installed on end-user computers, etc. is the most time-consuming task in help desk support. It is much easier to use a help desk system to automatically scan and discover installed IT assets, procure asset details, manage IT inventory, map assets with associated tickets, etc.
  6. Encourage self-service: The most effective way to resolve trivial issues is to help end-users learn how to resolve such things on their own. Minor issues, such as password resets, software updates, etc. can be fixed by end-users if proper guidance is provided. Shape your help desk as a self-service platform where users can find easy-to-fix solutions for common issues and resolve them without submitting help desk tickets.

 

By following these simple tips, you can save time and deliver more value to your end-users. If you want more information, check out this white paper that reviews major tasks performed by help desk analysts and IT support staff, and discusses how to simplify and automate those tasks.

 

How have you simplified your IT support tasks?

 

Share your help desk and remote support best practices so we can all benefit.

In the world of information technology, those of us who are tasked with providing and maintaining the networks, applications, storage, and services to the rest of the organization, are increasingly under pressure to provide more accurate, or at least more granular, service level guarantees. The standard quality of service (QoS) mechanisms we have used in the past are becoming more and more inadequate to properly handle the disparate types of traffic we are seeing on the wire today. In order to continue to successfully provide services in a guaranteed, deliberate, measurable, and ultimately very accurate manner, is going to require different tools and additional focus on increasingly more all encompassing ecosystems. Simply put: our insular fiefdoms are not going to cut it in the future. So, what are we going to do about the problem? What can we do to increase our end to end visibility, tracking, and service level guarantees?

 

One of the first things we ought to do is make certain that we have, at the very least, implemented some baseline quality of service policies. Things like separating video, voice, regular data, high priority data, control plane data, etc., seem like the kind of thing that should be a given, but every day I am surprised by another network that has very poorly deployed what QoS they do have. Often I see video and voice in the same class, and I see no class for control plane traffic; my guess is no policing either, but that is another topic for another day. If we cannot succeed at the basics, we most certainly should not be attempting anything more grandiose until we can fix the problems of today.

 

I have written repeatedly on the need to break down silos in IT, to get away from the artificial construct that says one group of people only control one area of the network, and have only limited interaction with other teams. Many times, as a matter of fact, I see such deep and ingrained silos that the different departments do not actually converge, from a leadership perspective, until the CIO. This unnecessarily obfuscated the full network picture from pretty much everyone. Server teams know what they have and control, storage teams are the same, and on down the line it goes with nobody really having an overall picture of things until you get far enough into the upper management layer that the fixes become political, and die by the proverbial committee.

 

In order to truly succeed at providing visibility to the network, we need to merge the traditional tools, services, and methodologies we have always used, with the knowledge and tools from other teams. Things like application visibility, hooks into virtualized servers, storage monitoring, wireless and security and everything in between need to be viewed as one cohesive structure on which service guarantees may be given. We need to stop looking at individual pieces, applying policy in a vacuum, and calling it good. When we do this it is most certainly not good or good enough.

 

We really don’t need QoS, we need full application visibility from start to finish. Do we care about the plumbing systems we use day to day? Not really, we assume they work effectively and we do not spend a lot of time contemplating the mechanisms and methodologies of said plumbing. In the same way, nobody for whom the network is merely a transport service cares about how things happen in the inner workings of that system, they just want it to work. The core function of the network is to provide a service to a user. That service needs to work all of the time, and it needs to work as quickly as it is designed to work. It does not matter to a user who is to blame when their particular application quits working, slows down, or otherwise exhibits unpleasant and undesired tendencies, they just know that somewhere in IT, someone has fallen down on the job and abdicated one of their core responsibilities: making things work.

 

I would suggest that one of the things we should certainly be implementing, looking at, etc., is a monitoring solution that can not only tell us what the heck the network routers, switches, firewalls, etc., are doing at any given time, but one in which applications, their use of storage, their underlying hardware—virtual, bare metal, containers—and their performance are measured as well. Yes, I want to know what the core movers and shakers of the underlying transport infrastructure are doing, but I also want visibility into how my applications are moving over that structure, and how that data becomes quantifiable as relates to the end user experience.

 

If we can get to a place where this is the normal state of affairs rather than the exception, using an application framework bringing everything together, we’ll be one step closer to knowing what the heck else to fix in order to support our user base. You can’t fix what you don’t know is a problem, and if all groups are in silos, monitoring nothing but their fiefdoms, there really is not an effective way to design a holistic, network-wide solution to the quality of service challenges we face day to day. We will simply do what we have always done and deploy small solutions, in a small way, to larger problems, then spend most of our time tossing crap over the fence to another group with a “it’s not the network” thrown in as well. It’s not my fault, it must be yours. And at the end of the day, the users are just wanting to know why the plumbing isn’t working and the toilets are all backed up.

DODDC_logo.pngThe SolarWinds booth at DevOpsDays DC represents SolarWinds' third appearance at the event (after Columbus and Austin (https://thwack.solarwinds.com/community/solarwinds-community/geek-speak_tht/blog/2016/05/23/devopsdays-daze) ). I could play up the cliche and say that the third time was the charm, but the reality that we who have attended - myself, Connie (https://thwack.solarwinds.com/people/ding), Patrick (https://thwack.solarwinds.com/people/patrick.hubbard), and Andy Wong - were charmed from the moment we set foot in the respective venues.

 

While Kong (https://thwack.solarwinds.com/people/kong.yang) and Tom (https://thwack.solarwinds.com/people/sqlrockstar) - my Head-Geeks-In-Arms - are used to more intimate gatherings, like VMUGs and SQL Saturdays, I'm used to the big shows: CiscoLive, InterOp, Ignite, VMWorld, and the like. DevOps Days is a completely different animal, and here's what I learned:

 

 

Focus

The people coming to DevOpsDays are focused. As much as I love to wax philosophical about all things IT, and especially about all things monitoring, the people who I spoke with wanted to stay on topic. That meant cloud, continuous delivery, containers, and the like. While it might have been a challenge for an attention-deficit Chatty Kathy like me, it was also refreshing.

 

 

There was also focus of purpose. DevOpsDays is a place where attendees come to learn, not to be marketed to (or worse, AT). So there are no scanners, no QR codes on the badge, nothing. People who come to DevOpsDays can't be guilted or enticed into giving vendors their info unless they REALLY mean it, and then it's only the info THEY want to give. Again, challenging, but also refreshing.

 

 

Conversations

That focus reaps rewards in the form of real conversations. We had very few drive-by visitors. People who approached the table were genuinely interested in hearing what SolarWinds was all about. They might not personally be using our software (although many were), but they were part of teams and organizations that had use for monitoring. More than once, someone backed away from the booth, saying, "Hang on. I gotta see if my coworkers know about this."

 

 

The conversations were very much a dialogue, as opposed to a monologue. Gone was the typical trade show 10-second elevator pitch. We got to ask questions and hear real details about people's environments, situations, and challenges. That gave us the opportunity to make suggestions, recommendations, or just commiserate.

 

Which meant I had a chance to really think about...

 

 

The SolarWinds (DevOps) Story

"So how exactly does SolarWinds fit into DevOps?" This was a common question, not to mention a perfectly valid one given the context. My first reaction was to talk about the Orion SDK  and how SolarWinds can be leveraged to do all the things developers don't really want to recreate when it comes to monitoring-type activities. Things like:

 

  • A job scheduler to perform actions based on date or periodicity.
  • Built-in account db that hands username/password combinations without exposing them to the user.
  • The ability to push code to remote systems, execute it, and pull back the result or return code.
  • Respond with an automatic action when that result or return code is not what was expected.

 

But as we spoke to people and understood their needs, some other stories emerged:

 

  • Using the Orion SDK to automatically add a system which was provisioned by chef, jenkins, or similar tools into monitoring.
  • Perform a hardware scan of that system to collect relevant asset and hardware inventory information.
  • Feed that information into a CMDB for ongoing tracking.
  • Scan that system for known software.
  • Automatically apply monitoring templates based on the software scan.

 

 

This is part of a continuous delivery model that I hadn't considered until digging into the DevOpsDays scene, and I'm really glad I did.

 

 

Attending the conferences and hearing the talks, I also believe strongly that traditional monitoring - fault, capacity, and performance - along with alerting and automation, are still parts of the culture that DevOps advocates and practitioners don't hear about often enough. And I'm submitting CFP after CFP until I have a chance to tell that story.

 

 

Is SolarWinds a hardcore DevOps tool? Of course not. If anything, it's a hardcore supporter in the "ops" side of the DevOps arena. Even so, SolarWinds tools have a valid, rightful place in the equation, and we're committed to being there for our customers. "There" in terms of our features, and "there" in terms of our presence at these conferences.

 

 

So come find us. Tell us your stories. We can't wait to see you there!

Happy Father's Day! Okay, that was last Sunday, but I wanted to mention it, anyway. I hope everyone enjoyed the day. And it gives me a chance to remind you about this video our team made.

 

Anyway, here is this week's list of things I find amusing from around the Internet. Enjoy!

 

Courageous robot escapes oppressors, runs out of battery in the middle of the road

Yep. Not long before they take up arms against humanity. The singularity is fast approaching.

 

Student hacks Pentagon websites and gets thanked

I saw a movie about this years ago, the boy was rewarded with a trip to NORAD and got a chance to meet Dabney Coleman.

 

Bots -- Harmful Or Helpful?

I'm going to go out on a limb and guess that the article will suggest the answer is both.

 

Most Cyberattacks Are An Inside Job

Yes, and mostly through social engineering, an issue that doesn't get nearly enough focus.

 

Google Co-founder Larry Page Behind Two Flying Car Startups

It's 2016 and I still don't have the flying car I was promised, but it looks like that day is getting closer.

 

NFL Running Back, Darren McFadden, Sues Former Business Manager After Bitcoin Investment

Wow, this guy is going to ruin the reputation of the real Michael Vick.

 

The tongue-in-cheek way the women of Google are responding to a shareholder’s sexist comment

Brilliant.

 

I didn't get a tie for Father's Day, instead my family got me a brand new DBA shirt.

 

IMG_3097.JPG

After sharing war stories about passwords, we’re going to take a look at another important part of your internal IT security: identity management. This is the process of maintaining your user database, including who is added to your corporate directory, what happens if they change their name or change their role, how you handle their account if they go on extended leave, and what you do when they leave your organization. There’s a good chance that your organization has some pretty good policies and procedures around all of this already (or should have), and will rely on some input from your HR department.

 

Access to SaaS applications in the cloud also requires the establishment and management of corporate identities, so this is a really good thing to have sorted out BEFORE you create your first lot of cloud accounts. Note: my options are mostly Microsoft-centric because that’s what I know. Feel free to leave a comment and tell me what else works.

 

The worst case is that your SaaS application is a completely isolated user directory from your organization. And while there are some cases where that separation might be beneficial, it means that you are going to have to run TWO processes and change things in two systems (in your corporate directory and in the cloud) when identities change. It’s possible, but also annoying. It means your users will have two logins and two passwords to maintain. Last time in the comments, we touched on the pain of different password lengths/qualities and expirations across different systems. The other problem is the risk of things getting out of sync. If your new process is not followed to the letter, you could end up with a disabled account for an ex-employee, who still has access to your corporate information in the SaaS application. I hope they left on good terms.

 

At the other end of the scale, we have directory integration. In the Microsoft world, that’s either Federation or Directory Synchronization. The concept of Federation is pretty cool. My favorite analogy is a theme park pass. With Federation, the San Diego Zoo AND Knott’s Berry Farm will both let you in with a SoCal Theme Park Pass ticket, even though that ticket wasn’t issued by them. You can continue to do your own identity management internally, and when you suspend an account, it’s not getting access to your SaaS application. Your users enjoy single sign on, passwords never leave your organization, and multi-factor authentication is supported. Azure Active Directory even talks to 3rd-party identity providers like PingFederate and Okta.

 

The gotcha with Federation is that it requires some resilient infrastructure. If your ADFS server is unavailable, people can’t authenticate. For this reason, it’s generally discouraged for smaller businesses.

 

Directory Synchronization is another option. This connector manages updates between your on-premises Active Directory and Azure Active Directory, and also lets you filter which internal accounts sync up to the cloud.
You can then use Azure Active Directory Premium to provide single sign-on to many compatible SaaS applications. You can also hide the password to those systems, so, for example, your marketing team can access your corporate social media account and never know the password. In that case, if they leave, they can’t log in because they know the generic account, and you haven’t changed the password, yet.

 

See Simon May’s extensive list of resources for Active Directory Federation Services (ADFS) and Azure Active Directory Sync (DirSync).

 

Outside of the Microsoft world, maybe you’ll take a look at one of the many Identity-as-a-Service players. If you’re interested, Gartner even has a magic quadrant for it. My favorite has to be OneLogin for it’s ease of use and powerful features.

 

Of course, all of this is useless if the SaaS application you are considering doesn’t support any kind of directory integration. Then you’re back to that manual process. But better to find out during your discovery and pilot process as opposed to after you’ve been asked to provision 300 users.

 

Share your thoughts on the following: Is identity management a show stopper for SaaS adoption? Is it easy with your current infrastructure? Or do you shudder just thinking about it?

 

-SCuffy

Within the government, particularly the U.S. Defense Department, video traffic—more specifically videoconference calling—often is considered mission critical.

 

The Defense Department uses video for a broad range of communications. One of the most critical uses is video teleconference (VTC) inside and outside of the United States and across multiple areas of responsibility. Daily briefings—via VTC over an Internet protocol (IP) connection—help keep everyone working in sync to accomplish the mission. So, you can see why it is so important for the network to be configured and monitored to ensure that VTCs operate effectively.

 

VTC and network administration tasks boil down to a few key points:

 

  • Ensuring the VTC system is up and operational (monitoring).
  • Setting up the connections to other endpoints (monitoring).
  • Ensuring that the VTC connection operates consistently during the call (quality of service) Troubleshooting at the VTC system level (VTC administration), and after the connection to the network, the network administrator takes over to ensure that the connection stays alive (monitoring/configuration).

 

Ensuring Quality of Service for Video over IP

 

The DOD has developed ways to ensure successful live-traffic streaming over an IP connection. These requirements focus on ensuring that video streaming has the low latency and high throughput needed among all endpoints of a VTC. Configuring the network to support effective VTCs is challenging, but it is done through implementing quality of service (QoS).

 

You can follow these four steps:

 

Step 1: Establish priorities. VTC traffic will need high priority. Email would likely have the lowest priority, while streaming video (vs. VTC) will likely have a low priority as well.

 

Step 2: Test your settings. Have you set up your QoS settings so that VTC traffic has the highest priority?

 

Step 3: Implement your settings. Consider an automated configuration management tool to speed the process and eliminate errors.

 

Step 4: Monitor your network. Once everything is in place, monitor to make sure policies are being enforced as planned and learn about network traffic.

 

Configuring and Monitoring the Network

 

Network configuration is no small task. Initial configuration and subsequent configuration management ensures routers are configured properly, traffic is prioritized as planned and video traffic is flowing smoothly.

 

Network configuration management software that automates the configuration tasks of implementing complex QoS settings can be useful, and should support the automation of:

 

  1. Pushing out QoS settings to the routers- QoS settings are fairly complex to implement. It is important that implementation of settings is not done manually, due to errors.
  2. Validating that the changes have been made correctly- After the settings are implemented on a router, it is important to back up and verify the configuration settings.
  3. Configuration change notification.

 

Network monitoring tools help validate critical network information, and should provide you with the following information:

 

  1. When and where is my network infrastructure busy?
  2. Who is using the network at those hot spots and for what purpose?
  3. When is the router dropping traffic, and what types of packets are being dropped?
  4. Identify if your side or the far side of the VTC call systems are up and operational.
  5. Identify via node and interface baselines to identify abnormal spikes during the day.

 

What are your best practices for ensuring video traffic gets through? Do you have any advice you can share?

 

Find the full article on Signal.

In the first post of this series we took a look at the problems that current generation WANs don’t have great answers for.  In the second post of the series we looked at how SD-WAN is looking to solve some of the problems and add efficiencies to your WAN.

 

If you haven’t had a chance to do so already, I would recommend starting with the linked posts above before moving on to the content below.

 

In this third and final post of the series we are going to take a look at what pitfalls an SD-WAN implementation might introduce and what are some items you should be considering if you’re looking to implement SD-WAN in your networks.

 

Proprietary Technology

 

We've grown accustom to having the ability to deploy openly developed protocols in our networks and SD-WAN takes a step backwards when it comes to openness.  Every vendor currently in the market has a significant level of lock in when it comes to their technology.  There is no interoperability between SD-WAN vendors and nothing on the horizon that looks like this fact will change.  If you commit to Company X's solution, you will need to implement the Company X product in every one of your offices if you want it to have SD-WAN level features available.  Essentially we are trading one type of lock in (service-provider run MPLS networks or private links) for another (SD-WAN overlay provider). You will need to make a decision about which lock-in is more limiting to your business and your budget.  Which lock-in is more difficult to replace, the MPLS underlay or the proprietary overlay?

 

Cost Savings

 

The cost savings argument is predicated on the idea that you will be willing to drop your expensive SLA backed circuits and replace them with generic Internet bandwidth.  What happens if you are unwilling to drop the SLA? Well the product isn't likely to come out as a cost savings at all.  There is no doubt that you will have access to features that you don't have now, but your organization will need to evaluate whether those features are worth the cost and lock-in that implementing SD-WAN incurs.

 

Vendor Survivability

 

We are approaching (might be over at this point) 20 vendors which are claiming to provide SD-WAN solutions. There is no question that it is one of the hottest networking trends at the moment and many vendors are looking to monopolize.  Where will they be in a year?  5 years? Will this fancy new solution that you implemented be bought out by a competitor, only to be discarded a year or two down the line?  How do you pick winners and losers in a highly contested market like the SD-WAN market currently is?  I can't guarantee an answer here, but there are some clear leaders in the space and a handful of companies that haven't fully committed to the vision.  If you are going to move forward with an SD-WAN deployment, you will need to factor in the organizational viability of the options you are considering.  Unfortunately, not every technical decision gets to be made on the merit of the technical solution alone.

 

Scare Factor

 

SD-WAN is a brave new world with a lot of concepts that network engineering tradition tells us to be cautious of.  Full automation and traffic re-rerouting has not been something that has been seamlessly implemented in previous iterations.  Controller based networks are a brand new concept on the wired side of the network. It's prudent for network engineers to take a hard look at the claims and verify the questionable ones before going all in.  SD-WAN vendors by and large seem willing to provide proof of concept and technical labs to convince you of their claims.  Take advantage of these programs and put the tech through its paces before committing on an SD-WAN strategy.

 

It's New

 

Ultimately, it's a new approach and nobody likes to play the role of guinea pig.  The feature set is constantly evolving and improving.  What you rely on today as a technical solution may not be available in future iterations of the product.  The tools you have to solve a problem a couple of months from now, may be wildly different than the tools you currently use.  These deployments also aren't as well tested as our traditional routing protocols.  There is a lot about SD-WAN that is new and needs to be proven.  Your tolerance for the risks of running new technology has to be taken into account when considering an SD-WAN deployment.

 

Final Thoughts

 

It’s undeniable that there are problems in our current generation of networks that traditional routing protocols haven’t effectively solved for us.  The shift from a localized perspective on decision making to a controller based network design is significant enough to be able to solve some of these long standing and nagging issues.  While the market is new, and a bit unpredictable, there is little doubt that controller based networking is the direction things are moving both in the data center and the WAN.  Also, if you look closely enough, you’ll find that these technologies don’t differ wildly from the controller based wireless networks many organizations have been running for years.  Because of this I think it makes a lot of sense to pay close attention to what is happening in the SD-WAN space and consider what positive or negative impacts an implementation could bring to your organization.

deadlock.png

 

First, a quick definition and example for those that don’t know what deadlocks are inside of a database.

 

A deadlock happens when two (or more) transactions block each other by holding locks on resources that each of the transactions also need.

 

For example:

 

Transaction 1 holds a lock on Table A.

Transaction 2 holds a lock on Table B.

Transaction 1 now requests a lock on Table B, and is blocked by Transaction 2.

Transaction 2 now requests a lock on Table A, and is blocked by Transaction 1.

 

Transaction 1 cannot complete until Transaction 2 is complete, and Transaction 2 cannot complete until Transaction 1 is complete. This is a cyclical dependency and results in what is called a deadlock. Deadlocks can involve more than two transactions, but two is the most common scenario.

 

If you scrub the intertubz for deadlock information you will find a common theme. Most people will write that deadlocks cannot be avoided in a multi-user database. They will also write about how you need to keep your transactions short, and to some that means having your stats and indexes up to date, rather than a good discussion over what a normalized database would look like.

 

(And before I go any further, let me offer you some advice. If you believe that constantly updating your stats is a way to prevent deadlocks in SQL Server, then you should find a new line of work. Actually, stay right where you are. That way people like me will continue to have jobs, cleaning up behind people such as yourself. Thanks.)

 

What causes deadlocks?

The database engine does not seize up and start deadlocking transactions because it happens to be tired that day. Certain conditions must exist in order for a deadlock to happen, and all of those conditions require someone, somewhere, to be using the database.

 

Deadlocks are the result of application code combined with a database schema that results in an access pattern that leads to a cyclical dependency.

 

That’s right. I said it. Application code causes deadlocks.

Therefore it is up to the database administrator to work together with the application developer to resolve deadlocks.

 

Another thing worth noting here is that deadlocking is not the same as blocking. I think that point is often overlooked. I have had several people explain that their database is suffering blocking all the time. When I try to explain that a certain amount of blocking is to be expected, I am usually met with, "Yeah, yeah, whatever. Can you just update the stats and rebuild my indexes so it all goes away?”

 

A better response would be, "Yeah, I know I need to look at my design, but can you rebuild the indexes for me right now, and see if that will help for the time being?" My answer would be, “Yes. Can I help you with your design?"

 

Oh, and you do not need large tables with indexes to facilitate a deadlock. Blocking and deadlocks can happen on small tables, as well. It really is a matter of application code, design, access patterns, and transaction isolation levels.

 

Look, no one likes to be told they built something horrible. And chances are when it was built it worked just fine, but as the data changes, so could the need for an updated design. So if you are a database developer, do not be offended if someone says, "We need to examine the design, the data, and the code.” It is just a simple fact that things change over time.

 

Finding deadlocks

Here is a link to Bart Duncan's blog series that helps to explain deadlocking as well as the use of trace flag T1222. If you are experiencing deadlocks and want to turn this on now, simply issue the following statement:

 

DBCC TRACEON (1222, -1)

 

The flag will be enabled and will start logging detailed deadlock information to the error log. The details from this trace flag are much easier to understand than the original Klingon returned by T1204. Unfortunately, by using the DBCC, this trace flag will be lost after the next service restart. If you want the trace flag to be enabled and always running against your instance, you need to add -T1222 as a startup parameter to your instance.

 

Another method for seeing detailed deadlock information is to query the default Extended Event system health session. You can use the following code to examine deadlock details:

 

SELECT XEvent.query('(event/data/value/deadlock)[1]') AS DeadlockGraph

FROM ( SELECT XEvent.query('.') AS XEvent

       FROM ( SELECT CAST(target_data AS XML) AS TargetData

              FROM sys.dm_xe_session_targets st

JOIN sys.dm_xe_sessions s

                   ON s.address = st.event_session_address

              WHERE s.name = 'system_health'

AND st.target_name = 'ring_buffer'

              ) AS Data

              CROSS APPLY

TargetData.nodes

('RingBufferTarget/event[@name="xml_deadlock_report"]')

              AS XEventData ( XEvent )

      ) AS src;

 

There are additional ways to discover if deadlocks are happening, such as using SQL Server Profiler (or a server trace), as well as Performance Monitor (i.e., Perfmon) counters. Each of the methods above will be reset upon a server restart, so you will need to manually capture the deadlock details for historical purposes, if desired.

 

Resolving deadlocks

Resolving a deadlock requires an understanding of why the deadlocks are happening in the first place. Even if you know a deadlock has happened, and you are looking at the deadlock details, you need to have an idea about what steps are possible.

 

I’ve collected a handful of tips and tricks over the years to use to minimize the chances that deadlocks happen. Always consult with the application team before making any of these changes.

 

  1. Using a covering index can reduce the chance of a deadlock caused by bookmark lookups.
  2. Creating indexes that match your foreign key columns can reduce your chances of having deadlocks caused by cascading referential integrity.
  3. When writing code, it is useful to keep transactions as short as possible and access objects in the same logical order when it makes sense to do so.
  4. Consider using one of the row-version based isolation levels READ COMMITTED SNAPSHOT or SNAPSHOT.
  5. The DEADLOCK_PRIORITY session variable will specify the relative importance that the current session is allowed to continue processing if it is deadlocked with another session.
  6. You can trap for the deadlock error number using TRY…CATCH logic and then retry the transaction.

 

Summary

The impact of a deadlock on end-users is a mixture of confusion and frustration. Retry logic is helpful, but having to retry a transaction simply results in longer end-user response times. This leads to the database being seen as a performance bottleneck, and pressures the DBA and application teams to track down the root cause and fix the issue.

 

As always, I hope this information helps you when you encounter deadlocks in your shop.

Solarwinds Network Performance Monitor 12 (NPM12) launched last week to great fanfare. I've written quite a bit about the problems that network professionals face in the past few weeks. NPM12 solves quite a few of these issues quite nicely. But what I wanted to talk about today is how NPM12 solves an even more common problem: arguments.

Not my fault

How many times have you called a support line for your service provider to report an issue? Odds are good it's more than once. As proficient professionals, we usually do a fair amount of troubleshooting ahead of time. We try to figure out what went wrong and where it is so it can be fixed. After troubleshooting our way to a roadblock, we often find out the solution lies somewhere outside of our control. That's when we have to interact with support personnel. Yet, when we call them to get a solution to the problem, the quick answer is far too frequently, "It's not our fault. It has to be on your end."

Service providers don't like being the problem. Their resources aren't dedicated to fixing issues. So it's much easier to start the conversation with a defensive statement that forces the professional on the other end to show their work. This often give the service provider the time to try and track down the problem and fix it. Wouldn't it be great to skip past all this mess?

That's why I think that NPM12 and NetPath have the ability to give you the information you need before you even pick up the phone. Imagine being able to start a support call with, "We're seeing issues on our end and it looks like the latency on your AWS-facing interface on R1 has high latency." That kind of statement immediately points the service provider in the right direction instead of putting you on the defensive.

Partly cloudy

This becomes even more important as our compute resources shift to the cloud. Using on-premises tools to fix problems inside your data center is an easy task. But as soon as the majority of our network traffic is outbound to the public cloud we lose the ability to solve problems in the same way.

NetPath collects information to give you the power to diagnose issues outside your direct control. You can start helping every part of your network get better and faster and keep your users happy as they find their workloads moving to Amazon, Microsoft Azure, and others. Being able to ensure that the network will work the same way no matter where the packets are headed puts everyone at ease.

Just as the cloud helps developers and systems administrators understand their environments and gives them the data they need to be more productive and efficient, so too can NPM12 and NetPath give you the tools that you need to ensure that the path between your data center and the cloud is problem free and as open as the clear blue sky.

The end result of NPM12 and NetPath is that the information it can provide to you stops the arguments before they start. You can be a hero for not only your organization but for other organizations as well. Who knows? After installing NPM12, maybe other companies will start calling you for support?

The pool cover came off last week, which means my kids will want to go swimming immediately even though the water is 62F degrees. Ah, to be young and foolish again. Each time they jump in it reminds me of that time I told my boss "Sure, I'd love to be the DBA despite those other guys quitting just now."

 

Anyway, here is this week's list of things I find amusing from around the internet, enjoy!

 

Microsoft to Acquire LinkedIn for $26.2 Billion

Looks like Satya found a way to get LinkedIn to stop sending him spam emails.

 

Man dressed as Apple Genius steals $16,000 in iPhones from Apple Store

I know it isn't fair to compare retail theft to Apple security practices in general...but...yeah.

 

Twitter Hack Reminds Us Even Two-Factor Isn’t Enough

I've seen similar stories in the past but it would seem this type of attack is becoming more common. Maybe the mobile carriers can find a way to avoid social engineering attacks. If not, then this is going to get worse before it gets better.

 

This Year’s Cubs Might Be Better Than The Incredible ’27 Yankees

Or, they might not. Tough to tell right now, but since they are the Cubs I think I have an idea how this will end.

 

Identifying and quantifying architectural debt

There is a lot to digest in this one, so set aside some time and review. I promise it will help you when you sit in on design meetings, as you will be armed with information about the true cost of projects and designs.

 

Cold Storage in the Cloud: Comparing AWS, Google, Microsoft

Nice high-level summary of the three major providers of cold storage options. I think this is an area of importance in coming years, and the provider that starts making advances in functionality and usability will be set to corner a very large market.

 

The Twilight of Traceroute

As a database professional, I've used traceroute a handful of times. It's technology that is 30 years old and in need of a makeover.

 

Warm weather is here and I suddenly find myself thinking about long summer days at the beach with my friends and family.

 

IMG_1932.JPG

The federal technology landscape has moved from secure desktops and desk phones to the more sprawling environment of smartphones, tablets, personal computers, USB drives and more. The resulting “device creep” can often make it easier for employees to get work done – but it can also increase the potential for security breaches.

 

Almost half of the federal IT professionals who responded to our cyber survey last year indicated that the data that is most at risk resides on employee or contractor personal computers, followed closely by removable storage tools and government-owned mobile devices.

 

Here are three things federal IT managers can do to mitigate risks posed by these myriad devices:

 

1. Develop a suspicious device watch list.

 

As a federal IT manager, you know which devices are authorized on your network – but, more importantly, you also know which devices are not. Consider developing a list of unapproved devices and have your network monitoring software automatically send alerts when one of them attempts to access the network.

 

2. Ban USB drives.

 

The best bet is to ban USB drives completely, but if you’re not willing to go that far, invest in a USB defender tool. A USB defender tool in combination with a security information and event management (SIEM) will allow you to correlate USB events with other potential system usage and/or access violations to alert against malicious insiders.

 

They can be matched to network logs which help connect malicious activities with a specific USB drive and its user. They can also completely block USB use and user accounts if necessary. This type of tool is a very important component in protecting against USB-related issues.

 

3. Deploy a secure managed file transfer (MFT) system.

 

Secure managed file transfer systems can meet your remote storage needs with less risk.

 

File Transfer Protocol (FTP) used to get a bad rap as being unsecure, but that’s not necessarily the case. Implementing a MFT system can install a high-level of security around FTP, while still allowing employees to access files wherever they may be and from any government-approved device.

 

MFT systems also provide IT managers with full access to files and folders so they can actively monitor what data is being accessed, when and by whom. What’s more, they eliminate the need for USBs and other types of remote storage devices.

 

Underlying all of this, of course, is the need to proactively monitor and track all network activity. Security breaches are often accompanied by noticeable changes in network activity – a spike in afterhours traffic here, increased login attempts to access secure information there.

 

Network monitoring software can alert you to these red flags and allow you to address them before they become major issues. Whatever you do, do not idly sit back and hope to protect your data. Instead, remain ever vigilant and on guard against potential threats, because they can come from many places – and devices.

 

Find the full article on Government Computer News.

The other day I was having a conversation with someone new to IT (They had chosen to pursue down the programming track of IT which can be an arduous path for the uninitiated!). The topic of how teaching, education and learning to program came up and I’d like to share this analogy which not only works out with programming but also I’ve found pretty relevant to all aspects of an IT Management ecosystem.

 

The act of programming like any process driven methodology is an iterative series of steps, one where you will learn tasks one day which may ultimately remain relevant to future tasks. And then there are tasks you’ll need to perform which are not only absolutely nothing like you did the previous day, they’re unlike anything you’ve ever seen or imagined in your whole career, whether just starting out or you’ve been banging at the keys all your life. The analogy I chose for this was cooking, I know I know, the relevance hopefully should resonate with you!

 

When you get started out cooking, correct me if I’m wrong but you’re usually not expected to prepare a perfectly rising without falling soufflé, No, not at all.  That would be a poor teaching practice and one where you’re setting up someone for failure.  Instead, let’s start out somewhere simple, like Boiling water. You can mess up boiling water but once you start to understand the basic premise of it you can use it for all kinds of applications! Sterilizing water, cooking pasta or potatoes, the sky is the limit!  Chances are, once you learn how to boil water you’re not going to forget how to do it and perhaps will get even better at doing it, or find even more applications. The same is true systematically that once you start navigating into PowerShell, Bash, Python or basic batch scripts; what you did once you’ll continue to do over and over again because you understand it, but more so you got it down pat!

 

The day will come however you’re asked to do something you didn’t even think about the day prior, no more are you asked to perform a basic PowerShell script to dump the users last login you could whip up before in a single line of code (boiling water) and instead you’re asked to parse in an XLS or CSV file to go make a series of iterative changes throughout your Active Directory (Or for practical use-case sake, Query active directory for workstations which haven’t logged into AD or authenticated in the past 60 days, dump that into a CSV file, compare them against a defined whitelist you have in a separate CSV, as well as omitting specific OUs and then perform a mass ‘disable’ of those computer accounts while also moving them into a temporarily ‘Purge in 30 days’ OU and generate a report to review.) Oh we also want this script to run daily, but it can’t have any errors or impact any production machines which don’t meet these criteria.   Let’s for the sake of argument… Call this our soufflé…

 

Needless to say, that’s a pretty massive undertaking for anyone who was great at scripting, scripting that which they’ve been doing a million times before.   That is what is great about IT and cooking though.   Everything is possible, so long as you have the right ingredients and a recipe you can work off of.   In the given above scenario performing every one of those steps all at once might seem like the moon to you, but you may find that if you can break it down into a series of steps (recipe) and you’re able to perform each of those individually, you’ll find it is much more consumable to solve the bigger problem and tie it all together.

 

What is great about our IT as a community is just that, we are a community of others who have either done other things before, or perhaps have done portions of other things before and are often willing to share.    Double-down this to the fact that we’re also a sharing is caring kind of community who will often share our answers to complex problems, or work actively to help solve a particular problem.  I’m actually really proud to be a part of IT and how well we want each other to succeed while we continually fight the same problems irrespective of size of organization or where in the world.

 

I’m sure for every single one of us who has a kitchen we may have a cookbook or two on shelves with recipes and ‘answers’ to that inevitable question of how we go about making certain foods or meals.   What are some of your recipes you’ve discovered or solved over the time of your careers to help bring about success.  I personally always enjoyed taking complex scripts for managing VMware and converting them into ‘one-liners’ which were easy to understand and manipulate, both as a way for others to learn how to shift and change them, but also so I could run reports which were VERY specific to my needs at the moment while managing hundreds and thousands of datacenters.

 

I’d love it if you’d share some of your own personal stories, recipes or solutions and whether this analogy has been helpful if not explaining perhaps what we do in IT to our family who may not understand, but maybe in cracking the code on your next Systems Management or process challenge!

 

(For a link to my One-Liners post check out; PowerCLI One-Liners to make your VMware environment rock out! )

Capture.PNGApologies for the doomsday reference, but I think it’s important to draw attention to the fact that business-critical application failures are creating apocalyptic scenarios in many organizations today. As businesses become increasingly reliant on the IT infrastructure for hosting applications and Web services, the tolerance for downtime and degraded performance has become almost nil. Everyone wants 100% uptime with superior performance of their websites. Whether applications are hosted on-premises or in the cloud, whether they are managed by your internal IT teams or outsourced to managed service providers, maintaining high availability and application performance is a top priority for all organizations.

 

Amazon Web Services® (AWS) experienced a massive disruption of its services in Australia last week. Across the country, websites and platforms powered by AWS, including some major banks and streaming services, were affected. Even Apple® experienced a widespread outage in the United States last week, causing popular services, including iCloud®, iTunes® and iOS® App Store to go offline for several hours.

 

I’m not making a case against hosting applications in the cloud versus hosting on-premises. Regardless of where applications are running—on private, public, hybrid cloud, or co-location facilities—it is important to understand the impact and aftermath of downtime. Take a look at these statistics that give a glimpse of the dire impact due to downtime and poor application performance:

  • The average hourly cost of downtime is estimated to be $212,000.
  • 51% of customers say slow site performance is the main reason they would abandon a purchase online.
  • 79% of shoppers who are dissatisfied with website performance are less likely to buy from the same site again.
  • 78% of consumers worry about security if site performance is sluggish.
  • A one-second delay in page response can result in a 7% reduction of conversions.

*All statistics above sourced from Kissmetrics, Brand Perfect, Ponemon Institute, Aberdeen Group, Forrester Research, and IDG.

 

Understanding the Cost of Application Downtime

  • Financial losses: As seen in the stats above, customer-facing applications that perform unsatisfactorily affect online business and potential purchases, often resulting in customers taking their business to other competitors.
  • Productivity loss: Overall productivity will be impacted when applications are down and employees are not able to perform their job or provide customer service.
  • Cost of fixing problems and restoring services: IT departments spend hours and days identifying and resolving application issues, which involves labor costs, and time and effort spent on problem resolution.
  • Dent on brand reputation: When there is a significant application failure, customers will start having a negative perception about your organization and its services, and lose trust in your brand.
  • Penalty for non-compliance: MSPs with penalty clauses included in service level agreements will incur additional financial losses.

 

Identifying and Mitigating Application Problems

Applications are the backbone of most businesses. Having them run at peak performance is vital to the smooth execution of business transactions and service delivery. Every organization has to implement an IT policy and strategy to:

  1. Implement continuous monitoring to proactively identify performance problems and indicators.
  2. Identify the root cause of application problems, apply a fix, and restore services as soon as possible, while minimizing the magnitude of damage.

 

It is important to have visibility, and monitor application health (on-premises and in the cloud) and end-user experience on websites, but it is equally important to monitor the infrastructure that supports applications, such as servers, virtual machines, storage systems, etc. There are often instances where applications perform sluggishly due to server resource congestion or storage IOPS issues.

 

Share your experience of dealing with application failures and how you found and fixed them.

 

Check out this ebook "Making Performance Pay - How To Turn The Application Management Blame Game Into A New Revenue Stream" by NEXTGEN, an Australia-based IT facilitator, in collaboration with SolarWinds.

eBook_Cover-72-SolarWinds.jpg

On Thursday, we had our first Atlanta #SWUG (SolarWinds User Group).

THWACKsters from the greater Atlanta area and as far as New Jersey (shout out to sherndon1!) came together to celebrate THWACK’s 13th birthday & to unite in our love of SolarWinds.

 

20160609_131143.jpg

 

We did things a little differently this time by starting with a live announcement broadcasted to the THWACK homepage and our SolarWinds Facebook page.

In case you missed it, the live announcement & our gift to you is:

1) #THWACKcamp is coming back (September 14th & 15th)!

2) Registration is now open>> THWACKcamp 2016

3) If you register during the month of June you will be entered to win a trip to come to Austin for the live event for you + 1! (USA, CAN, UK & Germany are eligible)

 

Here’s a brief recap of the presentations & resources from the SWUG:

 

MC for this SWUG: patrick.hubbard, SolarWinds Head Geek

 

'NetPath™: How it works'

cobrien, Sr. Product Manager (Networks)

Slides: https://s3-us-west-2.amazonaws.com/swug/Atlanta/How+NetPath+Works.pptx

 

Chris O’Brien gave an overview and live demonstration of NetPath (a feature in the latest release of NPM 12). See it in action:

 

 

'Systems Product Roadmap Discussion + Q&A'

stevenwhunt, Sr. Product Manager (Systems)

 

What we’re working on for Systems products:

 

'Customer Spotlight: Custom Alerts for your Environment'

njoylif, SolarWinds MVP

Slides: https://s3-us-west-2.amazonaws.com/swug/Atlanta/RH_SWUG.pptx

 

Larry did an excellent job of talking about all the ways you can leverage custom properties in your environment.

He also connected the dots and demonstrated how alerting and custom properties go hand in hand.

 

larry rice.PNG

 

'SolarWinds Best Practices & Thinking Outside of the Box'

Dez, Technical Product Manager

KMSigma, Product Manager

 

Reach out to Dez or KMSigma for additional questions on the topics discussed:

  1. SAM as UnDP and TCP port monitoring

•OIDs, and MIBs, and SNMP – Oh, my.

  1. NCM approval process with RTN for security needs

•Manager & Multistage Approval, Real-time Notification, and Compliance

  1. Alerting best practices

•Leveraging Custom properties to reduce Alert Noise

  1. NetFlow Alerting with WPM

•Using WPM for Alerts that are currently not within for NTA

  1. Optimizing your SolarWinds Installation

•Building Orion Servers based on Role

  1. Upgrade Advisor

•Upgrade Paths Simplified

 

'Customer Spotlight: HTML is Your Friend: Leveraging HTML in Alert Notifications'

bsciencefiction.tv, SolarWinds MVP

mikegale

adamlboyd (Not in attendance, but may be able to help with any follow up questions)

Slides: https://s3-us-west-2.amazonaws.com/swug/Atlanta/SWUG_Presentation_HTML.pptx

 

Kevin & Michael did a great job of showing us how they are using HTML to customize their alerts.

Specifically, they showed us how they were able to customize their email alerts which solved for:

  • Teams paying attention to their alerts
  • Easy to digest
  • Easy to recognize issues
  • HTML 5 is responsive on Mobile so their alerts look good on different devices.

20160609_164421.jpg

 

SolarWinds User Experience "Dear New Orion User..."

meech, User Experience

 

 

Thank you to everyone who attended the event! We enjoyed meeting and talking with each of you.

We hope you’ll keep the conversation going on this post and on THWACK.

 

And last (but not least), thank you to our sponsor and professional services partner, Loop1Systems (BillFitz_Loop1) for hosting the happy hour!

Additional information on Loop1Systems: https://s3-us-west-2.amazonaws.com/swug/Atlanta/Loop1_SWUG.pptx

 

**If you left without filling out a survey, please help us out by telling us how we can make SWUGs even better>> http://thwack.SWUG-Feedback.sgizmo.com/s3/

 

If you are interested in hosting a SWUG in your area, please fill out a host application here>> http://thwack.swug-host.sgizmo.com/s3/

 

If you’re attending Cisco Live this year, RSVP to attend the SWUG we’re hosting during the event!

Virtualization admins have many responsibilities and wear many IT hats. These responsibilities can be aggregated into three primary buckets: planning, optimizing, and maintaining. Planning covers design decisions as well as discovery of virtual resources and assets. Optimizing encompasses tuning your dynamic virtual environment so that it's always running as efficiently and effectively as possible given all the variables of your applications, infrastructure, and delivery requirements. Maintaining is the creating the proper alerts and intelligent thresholds to match your constantly changing and evolving virtual data center. To do all these things well at various scale, virtualization admins need to monitor with discipline and a proper tool.

 

And sometimes you need a helping hand. Join me and jhensle on June 15th at 1PM CT as we cover plan, optimization, and maintenance with Virtualization Manager to deliver optimal performance in your virtual data center.

 

Need4Speed-VMAN.PNG

Summer is here! Well, for those of us in the Northern hemisphere at least. The unofficial start of summer for many of us here is the last weekend in May. For me, the official start to summer is when I first place my feet into the ocean surf after the last weekend in May. And since that happened this past weekend, I declare summer open for business.

 

Anyway, here is this week's list of things I find amusing from around the Internet...

 

How every #GameOfThrones episode has been discussed on Twitter

Because apparently this is a show worth watching, I guess, but I haven't started yet. It's about a throne or something?

 

Mark Zuckerberg's Twitter and Pinterest password was 'dadada'

Makes sense, since Zuckerberg doesn't care much about our privacy and security, that he probably didn't care too much about his own.

 

It's Time To Rethink The Password. Yes, Again

Yes, it's worth reminding all of you about data security again, because it seems that the message isn't getting through, yet.

 

Edward Snowden: Three years on

Since I'm already on the security topic, here's the obligatory mention of Snowden.

 

NFL Players' Medical Information Stolen

And a nice reminder that this is the league that cannot figure out how to accurately measure the air pressure in footballs, so don't be surprised they don't know how to encrypt laptops.

 

Where people go to and from work

Wonderful visualization of where people commute to and from work around the USA.

 

Solar FREAKIN' Roadways!

In case you haven't seen this yet, and since Summer is here, a nice reminder of how we could be using technology to raise our quality of life.

 

Here's to the start of Summer, a sunset at Watch Hill last weekend:

YABY6130.jpg

Every toolbox has a tool that is used on problems even though it's well past retirement age. Perhaps it's an old rusty screwdriver that has a comfortable handle. Or a hammer with a broken claw that has driven hundreds of nails. We all have tools that we fall back on even when better options are available. In the world of building and repairing physical things that means the project will take more time. But in the networking world old tools can cause more problems than they solve.

Traceroute is a venerable protocol used for collecting information about network paths. Van Jacobsen created it in 1987 to help measure paths that packets take in the network and the delays that those paths cause. It uses UDP packets with a Time To Live (TTL) of 1 to force a remote system to send a special ICMP message back to the source. These ICMP messages help the originating system build a table of the hops in the path as well as the latency to that hop.

Traceroute has served the networking world well for a number of years. It has helped many of us diagnose issues in service provider networks and figure out routing loops. It's like the trusty screwdriver or hammer in the bottom of the toolbox that's always there waiting for us to use them to solve a problem. But Traceroute is starting to show its age when it comes to troubleshooting.

  • Traceroute requires ICMP messages to be enabled on the return path to work correctly. The ICMP Time Exceeded message is the way that Traceroute works magic. If a firewall in the path blocks that message from returning, everything on the other side of that device is a black hole as far as Traceroute is concerned. Even though networking professionals have been telling security pros for years to keep ICMP enabled on firewalls, there are still a surprising number of folks that turn it off to stay "safe".
  • Traceroute doesn't work well with multiple paths. Traceroute was written in the day when one path was assumed to exist between two hosts. That works well when you have a single high speed path. But today's systems can take advantage of multiple paths to different devices across carrier networks. The devices in the middle can find the most optimum path for traffic and steer it in that direction. Traceroute is oblivious to path changes.
  • Traceroute can only provide two pieces of information. As long as all you really want to know about a network is a hop-by-hop analysis of a path and the delay on that path, Traceroute is your tool. But if you need to know other information about that path, like charting the latency over time or knowing when the best time to pick a specific multi path through the network would be then Traceroute's utility becomes significantly limited.

In a modern network, we need more information that we can get from a simple tool like Traceroute. We need to collect all kinds of data that we can't get from a thirty year old program written to solve a specific kind of issue. What we need is a better solution built into something that can collect the data we need from multiple points in the network and give us instant analysis about paths and how our packets should be getting from Point A to Point B as quickly as possible.

That would be something, wouldn't it?

Full disclosure: In my work life, I spend my time at OnX Enterprise Solutions, where we’re a key partner with HPE (Hewlett Packard Enterprise). The following are some thoughts regarding some of the aggressive steps recently taken by Meg Whitman and the board. To me, these seem bold and decisive. They have been pivoting to set themselves up for the future of IT.

 

 

  1. The changing of the focus for Helion. In recent travel, and conversations with customers, I’ve heard a distinct misconception regarding the stance of Hewlett Packard Enterprise and the Helion Platform. The mistake appeared when HPE made the statement that Helion would no longer be a Public Cloud platform. The goal here was not to do away with the OpenStack approach, but rather to focus those efforts on Hybrid and Private solutions.
    1. Helion is not just a fully supported OpenStack solution, but rather an entire suite of products based on the Helion OpenStack platform. Included are products like Eucalyptus, Cloud Service Automation, Content Depot, Helion CloudSystem and quite a few more. For further information on the suite, check out this link.
    2. The key here is that while the press stated that Helion was done-for, the fact is that HPE simply narrowed the market for the product. Instead of necessarily competing with the likes of IBM, Amazon, and Microsoft, now it becomes a more secure and function extension or replacement for your datacenter’s approach.
  2. The splitting of the organization from one umbrella company to two distinct elements: That of HPE and HP Inc. The former entails all enterprise class solutions including storage, servers and enterprise applications, the latter supplies printers, laptops, and most-profitably, supplies like printer ink and toner.
    1. The split offers up the opportunity for salespeople and technical resources to align with the proper silos of internal resources, and therefore expediting the dissemination of information through their channels about offerings etc., more efficiently.
    2. There’s little doubt that the split of the organization creates a level of operational and cost efficiency with which the company had some struggles.
    3. And, just within the last couple days, the Professional Services organization has made a fairly significant announcement to join in with CSC (Computer Sciences Corporation) professional services to become if not a single organization, at minimum a fully collaborative one. As many know, the services arm of HPE was bolstered by the addition of EDS in 2009. While I’m not sure that this will have a lot of impact on the quality of the professional services delivered, it leaves very little doubt that the sheer magnitude of resources from which to draw will be quite a bit larger.
  3. The emergence of Converged and HyperConverged as well as Composable Infrastructure platforms as a strategic move forward.
    1. Converged platforms like vBlock and FlexPod have been around for years. In addition, the more pod-like solutions like Simplivity and Nutanix, as well as many others have become quite popular. HPE has been working diligently on these platforms, with a more full-scope approach. Whether your organization has one of the smaller 250’s or the larger 380 series hyperconverged devices, and/or some of the larger systems, they’ve been designed to exist within a full ecosystem of product. All part of one environment. The HPE convergence platforms have come together quite nicely as an overall “Stack” for infrastructure regardless of sizing. Augmented by the utilization of HPE’s StorVirtual (the old LeftHand product, which has been expanded quite a bit in the last couple years), OpenView (For all intents and purposes, the most robust, along with the most mature tool of its ilk in the space), these offerings make the creation, deployment and management of physical and virtual machines a much more agile process.
    2. HPE Synergy, the HPE approach to a fully composable infrastructure, seeks to extend the model of “Frictionless” IT to leverage a DevOps approach. Scripting is built-in, fully customizable, and with plans to offer up a “GitHub” like approach to the code involved. For a full overview, refer to this webpage. Synergy will involve a series of both hardware and software components dedicated to the purpose that it can take the place as a recipe for large to huge enterprises supporting the entire infrastructure from soup to nuts. Again, I’ve not seen a portfolio of products that rivals this set of solutions for a global approach to the data-center ever. I have seen a number of attempts to move in this direction, and feel that we’re finally achieving the X86 mainframe approach that has been directional in enterprise infrastructure for years.

 

All-in-all, I would have to say that I’m impressed with the way that HPE, who in past years have been seen in maybe not the best forward thinking approaches, is making strategic maneuvers toward the betterment of the company. They’re deep in the market space, with R&D in practically every area of IT. OpenStack, All Flash Array, Hybrid storage, federational management, backup and recovery, new memory tech like 3D XPoint, and many other pieces of the technology landscape are being explored, utilized, and incorporated into the product set. I feel that this venerable company is proving that the startup world is not the only place where innovation occurs in IT, but that a large organization can accomplish some amazing things as well.

Today’s federal IT infrastructure is built in layers of integrated blocks, and users have developed a heavy reliance on well-functioning application stacks. App stacks are composed of application code and all of the software and hardware components needed to effectively and reliably run the applications. These components are very tightly integrated. If one area has a problem, the whole stack if effected.

 

It’s enough to frustrate even the most hardened federal IT manager. But don’t lose your cool. Instead, take heart, because there are better ways to manage this complexity. Here are areas to focus on, along with suggested methodologies and tools:

 

1. Knock down silos and embrace a holistic viewpoint.

 

Thanks to app stacks, the siloed approach to IT is quickly becoming irrelevant. Instead, managing app stacks requires realizing that each application serves to support the entire IT foundation.

 

That being said, you’ll still need to be able to identify and address specific problems when they come up. But you don’t have to go it alone; there are tools that, together, can help you get a grasp on your app stack.

 

2. Dig through the code and use performance monitoring tools to identify problems.

 

There are many reasons an app might fail. Your job is to identify the cause of the failure. To do that you’ll need to look closely at the application layer and keep a close eye on key performance metrics using performance monitoring tools. These tools can help you identify potential problems, including memory leaks, service failures and other seemingly minor issues that can cause an app to nosedive and take the rest of the stack with it.

 

3. Stop manually digging through your virtualization layers.

 

It’s likely that you have virtualization layers buried deep in your app stack. These layers probably consist of virtual machines that are frequently migrated from one physical server to another and storage that needs to be reprovisioned, reallocated and presented to servers.

 

Handling this manually can be extremely daunting, and identifying a problem in this can seem impossible. Consider integrating an automated VM management approach with the aforementioned performance monitoring tools to gain complete visibility of these key app stack components.

 

4. Maximize and monitor storage capabilities.

 

Storage is the number one catalyst behind application failures. The best approach here is to ensure that your storage management system helps monitor performance, automate storage capacity and regularly reports so you can ensure applications continue to run smoothly.

 

You’ll be able to maintain uptime, leading to consistent and perhaps increased productivity throughout the organization. And you’ll be able to keep building your app stack – without the fear of it all tumbling down.

 

Find the full article on Government Computer News.

DODDC_logo.png

It's hard to contain my excitement or stop wishing away the hours until my flight tomorrow, as I prepare for DevOpsDays DC this week.

 

After all the discussions from my trip to Interop, I've got my convention go-bag ready . The venue is awesome. The schedule looks stellar. And we were even able to throw some free tickets your way so we could be sure SolarWinds was represented in the audience (looking at you, Peter!).

 

Whenever we venture out to a DevOps Days get together, the conversation among the team is, "What stories do we want to tell?". The truth is, each of us has our own personal stories that we gravitate to. I like to talk about the THWACK community and how the time has come for monitoring to become its own specialty within IT like storage, networking, or infosec. Connie has a great time discussing ways our monitoring tools can be leveraged for DevOps purposes. And, of course, Patrick spends his DevOps Days spreading the gospel truth about SWIS, SWQL, and the Orion SDK.

 

However, this week, the story is going to be NPM 12. It really couldn't be about anything else, when you think about it.

 

It's the story I want to tell because it fills a gap that I think will resonate with the DevOps community: As you strive to create ever more scalable systems that can be deployed quickly and efficiently - which almost by definition includes cloud and hybrid-cloud deployments - how can you be sure that everything in-between the you (and the user) and the system is operating as expected?

 

When everything was on-prem that question was, if not easy to ascertain, at least simple to define. A known quantity of systems all under the company's control had to be monitored and managed.

go-bag.jpg

But how do we manage it now, when so much relies on systems that sit just past the ISP demarc?

 

As you know, NPM 12 and NetPath address exactly that question. And hopefully the attendees at DevOps Days DC will appreciate both the question and the answer we're providing.

 

For now, I'm biding my time, packing my bags, and looking wistfully at the door wondering if time will keep dragging along until I can get on the road and start telling my story.

This is the second installment of a three-part series discussing SD-WAN (Software Defined WAN), what current problems it may solve for your organization, and what new challenges it may introduce. Part 1 of the series, which discusses some of the drawbacks and challenges of our current WANs, can be found HERE.  If you haven’t already, I would recommend reading that post before proceeding.

 

Great!  Now that everyone has a common baseline on where we are now, the all-important question is…

 

Where are we going?

 

This is where SD-WAN comes into the picture.  SD-WAN is a generic term for a controller driven and orchestrated Wide Area Network.  I say it’s generic because there is no definition of what does and does not constitute an SD-WAN solution and as can be expected, every vendor approaches these challenges from their own unique perspectives and strengths.  While the approaches do have unique qualities about them, the reality is that they are all solving for the same set of problems and consequently have been coming to form a set of similar solutions.  Below we are going to take a look at these “shared” SD-WAN concepts on how these changes in functionality can solve some of the challenges we’ve been facing on the WAN for a long time.

 

Abstraction – This is at the heart of SD-WAN solutions even though abstraction in and of itself isn't a solution to any particular problem. Think of abstraction like you think about system virtualization.  All the parts and pieces remain but we separate the logic/processing (VM/OS) from the hardware (Server).  Although in the WAN scenario we are separating the logic (routing, path selection) from the underlying hardware (WAN links and traditional routing hardware).

 

The core benefit of abstraction is that it increases flexibility in route decisions and reduces dependency on any one piece of underlying infrastructure.  All of the topics below build upon this idea of separating the intelligence (overlay) from the devices responsible for forwarding that traffic (underlay).  Additionally, abstraction reduces the impact of any one change in the underlay, again drawing parallels from the virtualization of systems architecture.  Changing circuit providers or routing hardware in our current networks can be a time consuming, costly and challenging tasks.  When those components exist as part of an underlay, migration from one platform to another, or one circuit provider to another, becomes a much simpler task.

 

Centralized Perspective - Unlike our current generation of WANs, SD-WAN networks almost universally utilize some sort of controller technology.  This centrally located controller is able to collect information on the entirety of the network and intelligently influence traffic based on analysis of the performance of all hardware and links.  These decisions then get pushed down to local routing devices to enforce the optimal routing policy determined by the controller.

 

This is a significant shift from what we are doing today as each and every routing device is making decisions off of a very localized view of the network and only is only aware of performance characteristics for the links it is directly connected to.  By being able to see trouble many hops away from the source of the traffic, a centralized controller can route around it at the most opportune location, providing the best possible service level for the data flow.

 

Application Awareness - Application identification isn't exactly new to router platforms.  What is new, is the ability to make dynamic routing decisions based off of specific applications, or even sub-components of those applications.  Splitting traffic between links based off of business criticality and ancillary business requirements has long been a request of both small and large shops alike.  Implementing these policy based routing decisions in the current generation networks has almost always resulted in messy and unpredictable results.

 

Imagine being able to route SaaS traffic directly out to the internet (since we trust it and it doesn’t require additional security filtering), file sharing across your internet based IPSec VPN (since performance isn’t as critical as other applications), and voice/video across an MPLS line with an SLA (since performance, rather than overall bandwidth, are more important).  Now add 5% packet loss on your MPLS link… SD-WAN solutions will be able to dynamically shift your voice/video traffic to IPSec VPN since overall performance is better on that path.  Application centric routing, policy, and performance guarantees are significant advancements made possible with a centralized controller and abstraction.

 

Real Time Error Detection/Telemetry – One of the most frustrating conditions to work around on today’s networks is a brown out type condition that doesn’t bring down a routing protocol neighbor relationship.  While a visible look at the interfaces will tell you there is a problem, if the thresholds aren’t set correctly manual intervention is required to route around such a problem.  Between the centralized visibility of both sides of the link and the collection/analysis of real time telemetry data provided by a controller based architecture, SD-WAN solutions have the ability to route around these brown out conditions dynamically.  Below are three different types of error conditions one might encounter on a network and how current networks and SD-WAN networks might react to them.  This comparison is done based off a branch with 2 unique uplink paths.

 

Black Out:  One link fully out of service.

Current Routers:  This is handled well by current equipment and protocols.  Traffic will fail over to the backup link and only return once service has been restored.

SD-WAN:  SD-WAN handles this in identical fashion.

 

Single Primary Link Brown Out:  Link degradation (packet loss or jitter) is occurring on only one of multiple links.

Current Routers: Traditional networks don't handle this condition well until the packet loss is significant enough for routing protocols to fail over.  All traffic will continue to use the degraded link, even with a non-degraded link available for use.

SD-WAN:  SD-WAN solutions have the advantage of centralized perspective and can detect these conditions without additional overhead of probe traffic.  Critical traffic can be moved to stable links, and if allowed in the policy, traffic more tolerant of brown out conditions can still use the degraded link.

 

Both Link Brown Out:  All available links are degraded.

Current Routers:  No remediation possible.  Traffic will traverse the best available link that can maintain a routing neighbor relationship.

SD-WAN:  Some SD-WAN solutions provide answers even for this condition.  Through a process commonly referred to as Forward Error Correction, traffic is duplicated and sent out all of your degraded links.  A small buffer is maintained on the receiving end and packets are re-ordered once they are received.  This can significantly improve application performance even across multiple degraded links.

 

Regardless of the specific condition, the addition of a controller to the network gives a centralized perspective and the ability to identify and make routing decisions based on real-time performance data.

 

Efficient Use of Resources - This is the kicker, and I say that because all of the above solutions solve truly technical problems.  This one hits home where most executive care the most.  Due to the active/passive nature of current networks, companies who need redundancy are forced to purchase double their required bandwidth capacity and leave 50% of it idle when conditions are nominal.  Current routing protocols just doesn't have the ability to easily utilize disparate WAN capacity and then fall back to a single link when necessary.

 

Is it better to pay for 200% of the capacity you need for the few occasions when you need it, or pay for 100% of what you need and deal with only 50% capacity when there is trouble?

 

To add to this argument, many SD-WAN providers are so confident in their solutions that they pitch being able to drop more expensive SLA based circuits (MPLS/Direct) in favor of far cheaper generic internet bandwidth.  If you are able to procure 10 times the bandwidth, split across 3 diverse providers, would your performance be better than a smaller circuit with guaranteed bandwidth even with the anticipated oversubscription?  These claims need to be proven out but the intelligence that the controller based overlay network gives you could very well prove to negate the need to pay for provider based performances promises.

 

Reading the above list could likely convince someone that SD-WAN is the WAN panacea we’ve all been waiting for.  But, like all technological advancement, it’s never quite that easy.  Controller orchestrated WANs make a lot of sense in solving some of the more difficult questions we face with our current routing protocols but no change comes without its own risks and challenges.  Keep a look out for the third and final installment in this series where we will address the potential pitfalls associated with implementing an SD-WAN solution and discuss some ideas on how you might mitigate them.

Another important tip from your friendly neighborhood dev

By Corey Adler, Professional Software Developer

 

Greetings again, Thwack! It is I, your friendly neighborhood dev back again with another tip to help you with your code education. I appreciate all of the wonderful feedback that I’ve seen so far on my previous posts( So You’re Learning to Code? Let’s Talk, Still Learning to Code? Here’s a Huge Tip, and MOAR CODING TIP! AND LOTS OF CAPS! ), and am grateful to be able to help you out in any way that I can. Although my previous posts have dealt typically with actually coding, this post is going to deal with something else. You see, I’ve been here trying to help you now for 3 posts, and I’d like to think that I’m doing fairly well on that score, but you should know that in programming, much like in other fields as well, there are people that mean well when they want to help you but will fail miserably when push comes to shove. The kinds of people who shouldn’t be allowed to teach at all—even when they are motivated and want to do so. When these people do come to your aid you’ll typically end up the worse for it. And so, to that end:

 

                6) STAY AWAY FROM BROGRAMMERS. VERY, VERY, VERY FAR AWAY.

 

So what is a brogrammer, you ask? According to Urban Dictionary, a brogrammer is defined as: “A programmer who breaks the usual expectations of quiet nerdiness and opts instead for the usual trappings of a frat-boy: popped collars, bad beer, and calling everybody ‘bro’. Despised by everyone, especially other programmers.”

 

Now, before we get any further on this I want to emphasize that I am not singling out specific people. This is not a personal attack on anyone (in case any of the ones that I’ve met for some reason end up on Thwack, which I highly doubt). This is entirely a professional rebuke of the whole brogrammer mindset and attitude. It’s become more and more commonplace amongst entry-level software developers, and is one that can be highly toxic to those projects or people that are associated with them.

 

There are, traditionally, 2 problem areas that a brogrammer fits right into:

  1. Sexism
  2. Bad Craft

 

Let’s tackle these in reverse order.

 

What do I mean by “bad craft”? Oftentimes with brogrammers you will find an unwillingness to spend the extra time and attention to detail needed to create good code. Brogrammers tend to care about coming in each day, doing their 9-to-5 (not a minute more), and leaving to go have fun with their friends. They tend not to try and better their craft unless it’s absolutely necessary for them to keep their jobs (and beer money). Let me illustrate with an example: There exists a well-known company (name withheld for probable legal reasons) about which a former employee wrote the following:

 

[Company] is run on a tangled mess of homegrown tools, horrendously fragile code and the worst engineering practices I've ever seen from any company. There is no QA, code reviews aren't taken seriously, anyone can commit to master and push their code to production at any time…Brogramming is real and [Company] exemplifies it. It was the norm for bros to knowingly push buggy, incomplete, untested code into production after a few rounds of drinks then leave the problems for others while they moved onto another project.”

 

Or how about from this company:

 

“Once you're in, you'll be subject to one of the most cliquish offices I've had the "pleasure" to work in (I've been in IT for 8 years). The best word I can use to describe the existing staff: brogrammers. Definitely male-dominated. Every guy is an alpha male who probably rather go and lift weights, instead of writing solid code. The code base is a mess. Barely documented, sloppy, and basically changed on an ad hoc basis (if you ask about requirements documents you'll be laughed at). Everyone seems far too busy to do any kind of peer-review and coaching amounts to an irritated developer grudgingly taking a moment to help, and the help amounts to the developer doing it and basically saying, "there" and leaving.”

 

As someone who is starting and out, and trying to learn the basics of programming it is incumbent on you to find the best role models, who will help you learn and grow in the ways of the Force (i.e. programming). Brogrammers are the complete opposite of what you need. They don’t want to use good programming practices or have clean, understandable code. In such circumstances you would most likely be worse off after asking them for help.

 

Having said all that about “bad craft”, let me just mention that this problem is a drop in the bucket compared to the other problem—that of rampant sexism. Now, there is no possible way for me to fully cover all of the horror that is sharing an industry with these “people” in a short space such as this, but let me at least give you a taste of what I’m talking about. You see, because the above stuff that I wrote about bad craft? I wrote that 6 months ago. This next part? It’s taken a while. It’s been difficult to put my feelings into words that I can type without dry heaving into the wastebasket next to me at the thought that these “people” use the same tools that I do, use the same languages that I do, and pretend that their jobs are the same as mine. Let’s see some of the horror stories that I found in my research, both online and talking to my female co-workers.

 

  • Take, for example, a very famous social media company who decided to have an employee appreciation party. They even, it being an employee appreciation event, decided to let the employees vote and decide what type of party it was going to be. Not a bad idea, right? Pretty innocuous…until they voted to have it in the style of a frat party. The real cherry on top? Said company was embroiled in a lawsuit brought by a female former engineer for discrimination against women. How inclusive of them.
  • Or those times that my friend C, a fellow developer, would try to make suggestions to a client about the project she was working on and have them be rejected…only for said client to accept it once one of her male coworkers suggested it.
  • Or, as Leon sent me on Twitter, that time that someone spoke at SQL Saturday and received an evaluation form from someone that suggested that she could improve her speaking by wearing a negligee. (http://bit.ly/1TmgQ5a)
  • Or the engineer who was asked whether her job at the company was to “photocopy [stuff]” and that she was “too pretty to code.” (http://bit.ly/1CHmRqx)
  • Or the whole ball of putrid, stinking mess that is Gamergate.
  • Or the women who have gone to conferences…and been groped. (http://read.bi/1YGqcMI)
  • Or the developer conference in which a company sponsored a Women in Games lunch…and then had an after-party featuring scantily clad women dancing (http://bit.ly/1VqXgcl)
  • Or any of the tons of other stories that I found online while feeling my heart sink deeper and deeper, wondering how the heck we in programming and technology as a whole fell this low this fast.

 

The industry of Grace Hopper. Ada Lovelace. The ENIAC Programmers. Adele Goldberg. So much of the foundations for what we do every single working day of our lives we owe to these and other noble women who labored to provide us, their future, with this amazing gift. And now? We get excited that big tech companies like Google and eBay have 17% of their developer force as women! 17%. Since when has 17% been a big number? 17% is only a big number when you’re a 3rd-party candidate running for public office. For myself, I went through college only once having a computer science major class with more than 3 women in it, and that was only because it was cross-listed with the art department—which 2 of the women were from. We once could proudly say that nearly 40% of our graduates were women. Now? It’s less than half of that. Sure, there are tons of factors that can play into that, but none of them hurts more, none of them is quite as much a punch to the gut as brogrammers.

 

Because this is not who we are. This is not who we are supposed to be. We can and should be a hell of a lot better than this! We are the industry of outsiders. We are ones who played D&D and Magic: The Gathering in high school instead of trying to be like everyone else. We are the ones who talked about Science Fiction until people’s ears fell off. We are the ones who, when other industries demand suits, ties, and other more “professional” attire, said “No, thank you.” We were going to show all of them how we do things—our way. And yet we’ve fallen victim to this just like everyone else has. We let this happen to ourselves. We have no one else to blame. No one else that we should blame except ourselves.

 

The first step in solving a problem is recognizing that there is one. Brogrammers are a dollar store mustard stain on our industry, and one that seemingly keeps on growing. Their sexism as well as their lack of caring for the craft that we hold dear to our hearts hurts all of us who strive for better every day. Who think that these attitudes are obtuse, thick-headed, and generally uncouth. But even you, the programming novice can help out. Even you can do just one tiny little thing to help us fight this scourge of human garbage. Stay very far away from them. Don’t give them, with their pathetic, tiny little brains, the dignity of even acknowledging that they have some skill. And, if it’s not too much trouble, be sure to tell them all of this if they ever ask you why you don’t ask them for help. Because we don’t want them here. Don’t be quiet. Don’t be afraid to stand up against them. And don’t ever make the mistake of thinking that they have anything that we should learn from.

 

</rant>

 

Let me know how you feel in the comments below, and please feel free to share your stories about these and other tech troglodytes either here or by messaging me on Thwack. Until next time, I wish you good coding.

 

https://thwack.solarwinds.com/community/solarwinds-community/geek-speak_tht/blog/2015/12/01/moar-coding-tip-and-lots-of-caps

More often then not application owners look to their vendors to provide a list of requirements for a new project, the vendor forwards specifications that were developed around maximum load and in the age of physical servers.  These requirements eventually make their way on to the virtualization administrators desk.  8 Intel CPUs 2Ghz or higher, 16 GB Memory.   The virtualization administrator is left to fight the good fight with both the app owner and the vendors.  Now we all cringe when we see requirements such as above – we’ve worked hard to build out our virtualization clusters, pooling CPU and Memory to present back to our organizations, and we constantly monitor our environments to ensure that resources are available to our applications when they need them – and then a list of requirements like above come across through a server build form or email.

 

So whats the big deal?

 

We have lots of resources right?!? Why not just give the people what they want?  Before you just start giving away resources let’s take a look at both CPU and Memory and see just how VMware handles its scheduling and over commitment of both..

 

CPU

 

One of the biggest selling points of virtualization is the fact that we can have more vCPUs attached to VMs then we have physical CPUs on our hosts and let the ESXi scheduler take care of scheduling the VMs time on CPU.  So why don’t we just go ahead and give every VM 4 or 8 vCPUs?  You would think granting a VM more vCPUs would increase its performance – which it most certainly could – but the problem is you can actually hurt performance as well – not just on that single VM, but on other VMs running on the host as well.   Since the physical CPUs are shared there may be times where the scheduler will have to place CPU instructions on hold, or wait for physical cores to become available.  For instance, a VM with 4 vCPU’s will have to wait until 4 physical cores are available before the scheduler and execute its’ instructions, where-as a VM with 1 vCPU only has to wait for 1 logical core.  As you can tell, having multiple VMs each containing multiple vCPUs could in fact end up with a lot of queuing, waiting, and CPU ready time on the host, resulting in a significant impact to our performance.  Although VMware has made strides in CPU scheduling by implementing “Relaxed Co-scheduling”, it still only allows for a certain time drift between the execution of instructions across cores, and does not completely solve the issues around scheduling and CPU Ready – It’s always best practice to right size your VMs in terms of number of vCPUs to avoid as many scheduling conflicts as possible.

 

Memory

 

vSphere deploys many techniques when managing our virtual machine memory – VMs can share memory pages with each other, eliminating redundant copies of the same memory pages.  vSphere can also compress memory as well as deploy ballooning techniques which will allow on VM to essentially borrow allocated memory to another.  This built in intelligence almost masks away any performance or issues we might see with overcommitting RAM to our virtual machines.  That said memory is often one of the resources we run out of first and we should still take precautions in order to right-size our VMs to prevent waste.  The first thing to consider is overhead – by assigning additional un-needed memory to our VMs we increase the amount of overhead memory that is utilized by the hypervisor in order to run the virtual machine, which in turns takes memory from our pool available to other VMs.  The amount of overhead is determined by the amount of assigned memory as well as the number of vCPUs on the VM, and although this number is small (roughly 150MB for 16GB RAM/2vCPU VM) it can begin to add up as our consolidation ratios increase.  Aside from memory waste overcommitted memory also causes unnecessary waste on our storage end of things as well.  Each time a VM is powered on a swap file is created on disk of equal size to the allocated memory.  Again, this may not seem like a lot of wasted space at the time but as we create more and more VMs it can certainly add up to quite a bit of capacity.  Keep in mind that if there is not enough free space available to create this swap file, the VM will not be able to be powered on.

 

Certainly these are not the only impacts that oversized virtual machines have on our environment.  They can also impact certain features such as HA, vMotion times, DRS actions, etc but these are some of the bigger ones.  Right-sizing is not something that’s done once either – it’s important to constantly monitor your infrastructure and go back and forth with application owners as things change from day to day, month to month.  Certainly there are a lot of applications and monitoring systems out there that can perform this analysis for us so use them!

 

All that said though discovering our over and under sized VMs within our infrastructure is probably the easiest leg of the journey of reclamation.  Once we have some solid numbers and metrics in hands we need to somehow present these to other business units and application owners to try and claw back resources – this is where the real challenge begins.  Placing a dollar figure on everything, and utilizing features such as showback and chargeback may help, but again, its hard to take something back after its been given.  So my questions to leave you with this time are as follows – First up, how do you ensure your VMs are right-sized?  Do you use a monitoring solution available today to do so?  And if so, how often are you evaluating your infrastructure for right-sized VMs (Monthly, Yearly, etc.)?  Secondly, what do you for see as the biggest challenges in trying to claw back resources from your business?  Do you find that it’s more of a political challenge or simply an educational challenge?

Does anyone else remember January 7th, 2000? I do.  That was when Microsoft announced the Secure Windows Initiative.  An initiative dedicated to making Microsoft products secure. Secure from malicious attack. Where secure code was at the forefront of all design practices so that way practitioners like us wouldn’t have to worry about having to patch our systems every Tuesday.  So we wouldn’t need to worry about an onslaught of viruses and malware and you name it!    That is not to say that prior to 2000 that they were intentionally writing bad code, but it is to say that they’re making a hearty and conscious decision to make sure that code IS written securely.   So was born the era of SecOps.

 

16 years has passed, and not a year has gone by in that time where I haven’t heard organizations (Microsoft included) say, “We need to write our applications securely!” like it is some new idea that they’ve discovered for the first time.   Does this sound familiar to your organizations, to your businesses and processes you have to work with?  Buzzwords come out in the market place, make it into a magazine.  Perhaps new leadership or new individuals come in and say, “We need to be doing xyzOps! Time to change everything up!”

 

But to what end do we go through that?  There was a time when people adopted good, consistent and solid practices, educated and trained their employees, well refined processes which align with the business, and technology which isn’t 10 years out of date which allows you to handle and manage your requirements.   But we could toss that out the window so we can adopt the flavor of the week as well, may as well, right?

 

That said though, some organizations, businesses or processes receive nothing but the highest accolades and benefits by adopting a different or strict regime for how they handle things.  DevOps for the right applications or organization may be the missing piece of the puzzle which could enable agility or abilities which truly were foreign prior to that point.

 

If you can share what tools did you find which brought about your success or were less than useful in realizing the dream ideal state of xyzOps.   I’ve personally found that having buy-in and commitment throughout the organization was the first step in success when it came to adopting anything which touches every element of a transformation.

 

What are some of your experiences, of organizational shifts, waxing and waning across technologies. Where it was successful, and where it was wrought with failure like adopting ITIL in its full without considering for what it takes to be successful.  Your experiences are never more important than now to show others what the pitfalls are, how to overcome challenges, and also where things tend to work out well and where they fall short.

 

I look forward to reading about your successes and failures so that we can all learn together!

In part one of this series we looked at the pain that network variation causes. In this second and final post we’ll explore how the network begins to drift and how you can regain control. 

 

How does the network drift?

It’s very hard to provide a lasting solution to a problem without knowing how the problem occurred in the first instance. Before we look at our defenses we should examine the primary causes of highly variable networks.

 

  • Time The number one reason for shortcuts is that it takes too long to do it the ‘right way’.
  • Budget Sure it’s an unmanaged switch. That means low maintenance, right?
  • Capacity Sometimes you run out of switch ports at the correct layer, so new stuff is connected the the wrong layer. It happens.
  • No design or standards The time, budget and capacity problems are exacerbated by a lack of designs or standards.

 

Let’s walk through an example scenario. You have a de-facto standard of using layer-2 access switches, and an L3 aggregation pair of chassis switches. You’ve just found out there’s a new fifth-floor office expansion happening in two weeks, with 40 new GigE ports required.

 

You hadn’t noticed that your aggregation switch pair is out of ports so you can’t easily add a new access-switch. You try valiantly to defend your design standards, but you don’t yet have a design for an expanded aggregation-layer, you have no budget for new chassis and you’re out of time. 

 

So, you reluctantly daisy chain a single switch off an existing L2 access switch using a single 1Gbps uplink. You don’t need redundancy it’s only temporary. Skip forward a few months, you’ve moved onto the next crisis and you’re getting complaints of the dreaded ‘slow internet’ from the users on the fifth floor. Erm..

 

The defense against drift

Your first defense is knowing this situation will arise. It’s inevitable. Don’t waste your time trying to eliminate variation, your primary role is to manage the variation and limit the drift. Basic capacity planning can be really helpful in this regard.

 

Another solution is to use ‘generations’ of designs. The network is in constant flux but you can control it by trying to migrate from one standard design to the next. You can use naming schemes to distinguish between the different architectures, and use t-shirt sizes for different sized sites: S, M, L, XL. 

 

At any given time, you would ideally have two architectures in place, legacy and next-gen. Of course the ultimate challenge is to age-out old designs, but capacity and end-of-life drivers can help you build the business case to justify the next gen design.

 

But how do you regain control of that beast you created on the fifth floor? It’s useful to have documentation of negative user feedback, but if you can map and measure the performance this network showing that impact, then you’ve got a really solid business case.

 

A report from a network performance tool showing loss, latency and user pain, coupled with a solid network design makes for a solid argument and strong justification for an upgrade investment.

Compared to logging monitoring is nice and clean. In monitoring you are used to look at data that is already normalized. So you basically have the same look and feel for statistics from different sources like e.g.

switches, routers and servers. Of course you will have different services and checks across the different device types but some of these interface statistics can be compared easily with each other. Here you are always

looking at normalized data. In the Logging world you are facing very different Log types and formats. An interface that is down will look identically in the monitoring no matter if it is on the switch or the

connected server interface. If you now want to find the error massage for the interface that is down in the Logs of the switch and the server you will find two completely different outputs. Even how you can access

the logs is different. On a switch it is usually a ssh connection and a show command and on a windows based server eventually a RDP session and the "Event Viewer".

This is mostly a manual and time consuming process to compare the different logs with each other and find the root cause for the problem. An other problem is that many devices have only a limited storage for logs or

even worse loosing all stored logs after a reboot. Sometimes after an unexpected reboot of a device you end up with nothing in your hands to figure out what has caused the reboot.

 

We can do better with sending all the Logs to a centralized Logging server. It stores all log data independently from the origin. That reduces also the needed time for information gathering. Often you will see once all

the logs are concentrated at one point that many devices have different time stamps on their log massages. To make the logs easy consumable it is important that all log sources have the same time source and pointing

to a synchronized NTP server. Once the centralization problem is solved the biggest benefit comes from the normalization of the Log Data into logical fields that are searchable. This is something that is often done by

a SIEM solution that has been implemented to address the security aspect of logging. But I have seen a lot of SIEM projects where the centralized logging and normalization approach also improves the troubleshooting

capabilities significantly. With all the logs on the same place and format you can find dependency that are not visible in the monitoring. For example I was facing periodically reboots on a series of modular routers.

In the monitoring all the performance graphs looked normal and the router was answering to all SNMP and ICMP based checks as expected until it reboots without any warnings. So I looked into the log data and found

that 24 hours before the reboot was happening that on all of the routers a "memory error massage" was showing up.

Because the vendor needed some time to deliver a bug fix release that addressed this issue we needed a proper alarming for that. So every time we captured the " memory error massage" that triggered the reboot on the

centralized logging server we created an alarm , so that we could at least prepare a scheduled reboot that was manually initialized in a time frame when it effected less users. That was a blind spot in the monitoring

system and sometimes we can improve the alarming with the combination of Logging and active checks. So afterwards you have found the root cause for your problem ask yourself how you can prevent it from causing

an outage the next time. When there is the possibility to achieve that with logging this is worth the effort. You can start small and add more log massages over time that trigger events that are important for you.

do-not-enter-rz-lg.jpg

I've been working with SQL Server since what seems like forever ++1. The truth is I haven't been a production DBA in more than 6 years (I work in marketing now, in case you didn't know). That means I will soon hit a point in my life where I will be an ex-DBA for the same period of time as I was a production DBA (about seven years). I am fortunate that I still work with SQL Server daily, and consulted from time to time on various projects and performance troubleshooting. It helps keep my skills sharp. I also get to continue to build content as part of my current role, which is a wonderful thing because one of the best ways to learn something is to try to teach it to others. All of this means that over the years I've been able to compile a list of issues that I would consider to be common with SQL Server (and other database platforms like Oracle, no platform is immune to such issues). These are the issues that are often avoidable but not always easy to fix once they have become a problem. The trick for senior administrators such as myself is to help teams understand the costs, benefits, and risks of their application design options so as to avoid these common problems. So, here is my list of the top 5 most common problems with SQL Server.

 

Indexes

Indexes are the number one cause of problems with SQL Server. That doesn't mean SQL Server doesn't do indexes well. These days SQL Server does indexing quite well, actually. No, the issue with indexes and SQL Server have to do with how easy it easy for users to make mistakes with regards to indexing. Missing indexes, wrong indexes, too many indexes, outdated statistics, or a lack of index maintenance are all common issues for users with little to no experience (what we lovingly call 'accidental DBAs'). I know, this area covers a LOT of ground. The truth is that with a little bit of regular maintenance a lot of these issues disappear. Keep in mind that your end-users don't get alerted that the issue is with indexing. They just know that their queries are taking too long, and that's when your phone rings. It's up to you to know and understand how indexing works and how to design proper maintenance.

 

Poor design decisions

Everyone agrees that great database performance starts with great database design. Yet we still have issues with poor datatype choices, the use of nested views, lack of data archiving, and relational databases with no primary or foreign keys defined. Seriously. No keys defined. At all. You might as well have a bunch of Excel spreadsheets tied together with PowerShell, deploy them to a bunch of cluster nodes with flash drives and terabytes of RAM, and then market that as PowerNoSQL. You're welcome. It can quite difficult to make changes to a system once it has been deployed to production, making poor design choices something that can linger for years. And that bad design often forces developers to make decisions that end up with...

 

Bad Code

Of course saying 'bad code' is subjective. Each of us has a different definition of bad. To me the phrase 'bad code' covers examples such as unnecessary cursors, incorrect WHERE clauses, and a reliance on user-defined functions (because T-SQL should work similar to C++, apparently). Bad code on top of bad design will lead to concurrency issues, resulting in things like blocking, locking, and deadlocks. Because of the combination of bad code on top of poor design there has been a significant push to make the querying of a database something that can be automated to some degree. The end result has been a rise in the use of...

 

ORMs

Object-Relational Mapping (ORM) tools have been around for a while now. I often refer to such tools as code-first generators. When used properly they can work well. Unfortunately they often are not used properly, with the result being bad performance and wasted resources. ORMs are so frequent a problem that it has become easy to identify that they are the culprit. It's like instead of wiping their fingerprints from a crime scene the ORM will instead find a way to leave fingerprints, hair, and blood behind, just to be certain we know it is them. You can find lots of blog entries on the internet regarding performance problems with ORMs. One of my favorites is this one, which provides a summary of all the ways something can go wrong with an ORM deployment.

 

Default configurations

Because it's easy to click 'Next, Next, OK' and install SQL Server without any understanding about the default configuration options. This is also true for folks that have virtualized instances of SQL Server, because there's a good chance the server admins also choose some default options that may not be best for SQL Server. Things like MAXDOP, tempdb configuration, transaction log placement and sizing, and default filegrowth are all examples of options that you can configure before turning over the server to your end users.

 

The above list of five items is not scientific by any means, these are the problem that I find to be the most common. Think of them as buckets. When you are presented with troubleshooting performance, or even reviewing a design, these buckets help you to rule out the common issues and allow you to then sharpen your focus.

IT infrastructure design is a challenging topic. Experience in the industry is an irreplaceable asset to an architect, but closely following that in terms of importance is a solid framework around which to base a design. In my world, this is made clear by looking at design methodology from organizations like VMware. In the VCAP-DCD and VCDX certification path, VMware takes care to instill a methodology in certification candidates, not just the ability to pass an exam.

 

Three VCDX certification holders (including John Arrasjid who holds the coveted VCDX-001 certificate) recently released a book called IT Architect: Foundation in the Art of Infrastructure Design which serves exactly the same purpose: to give the reader a framework for doing high quality design.

 

In this post, I’m going to recap the design characteristics that the authors present. This model closely aligns with (not surprisingly) the model found in VMware design material. Nonetheless, I believe it’s applicable to a broader segment of the data center design space than just VMware-centric designs. In a follow-up post, I will also discuss Design Considerations, which relate very closely to the characteristics that follow.

 

Design Characteristics

Design characteristics are a set of qualities that can help the architect address the different components of a good design. The design characteristics are directly tied to the design considerations which I’ll discuss in the future. By focusing on solutions that can be mapped directly to one (or more) of these five design characteristics and one (or more) of the four considerations that will follow, an architect can be sure that there’s actually a purpose and a justification for a design decision.

 

It’s dreadfully easy – especially on a large design – to make decisions just because it makes sense at first blush. Unfortunately, things happen in practice that cause design decisions to have to be justified after the fact. And if they’re doing things correctly, an organization will require all design decisions to be justified before doing any work, so this bit is critical.

 

Here’s the 5 design characteristics proposed by the authors of the book:

 

Availability – Every business has a certain set of uptime requirements. One of the challenges an architect faces is accurately teasing these out. Once availability requirements are defined, design decisions can be directly mapped to this characteristic.

 

For example, “We chose such and such storage configuration because in the event of a loss of power to a single rack, the infrastructure will remain online thus meeting the availability requirements.”

 

Manageability – This characteristic weighs the operational impacts that a design decision will have. A fancy architecture is one thing, but being able to manage it from a day-to-day perspective is another entirely. By mapping design decisions to Manageability, the architect ensures that the system(s) can be sustainably managed with the resources and expertise available to the organization post-implementation.

 

For example, “We chose X Monitoring Tool over another option Y because we’ll be able to monitor and correlate data from a larger number of systems using Tool X. This creates an operational efficiency as opposed to using Y + Z to accomplish the same thing.”

 

Performance – As with availability, all systems have performance requirements, whether they’re explicit or implicit. Once the architect has teased out the performance requirements, design decisions can be mapped to supporting these requirements. Here’s a useful quote from the book regarding performance: “Performance measures the amount of useful work accomplished within a specified time with the available resources.”

 

For example, “We chose an all-flash configuration as opposed to a hybrid configuration because the performance requirements mandate that response time must be less than X milliseconds. Based on our testing and research, we believe an all-flash configuration will be required to achieve this.”

 

Recoverability – Failure is a given in the data center. Therefore, all good designs take into account the ease and promptness with which the status quo will be restored. How much data loss can be tolerated is also a part of the equation.

 

For example, “Although a 50 Mbps circuit is sufficient for our replication needs, we’ve chosen to turn up a 100 Mbps circuit so that the additional bandwidth will be available in the event of a failover or restore. This will allow the operation to complete within the timeframe set forth by the Recoverability requirements.”

 

Security – Lastly - but certainly one of the most relevant today - is Security. Design decisions must be weighed against the impact they’ll have on security requirements. This can often be a tricky balance; while a decision might help improve results with respect to Manageability, it could negatively impact Security.

 

For example, “We have decided that all users will be required to use two-factor authentication to access their desktops. Although Manageability is impacted by adding this authentication infrastructure, the Security requirements can’t be satisfied without 2FA.”

 

Conclusion

I believe that although infrastructure design is much an art as it is a science – as the name of the book I’m referencing suggests – leveraging a solid framework or lens through which to evaluate your design can help make sure there aren’t any gaps. What infrastructure design tools have you leveraged to ensure a high quality product?

Think about your network architecture. Maybe it's something older that needs more attention. Or perhaps you're lucky enough to have something shiny and new. In either case, the odds are very good that you have a few devices in your environment that you just can't live without. Maybe it's some kind of load balancer or application delivery controller. Maybe it's an IP Address Management (IPAM) device that was built ages ago but hasn't been updated in forever.

The truth of modern networks is that many of them rely on devices like this as a lynchpin to keep important services running. If those devices go down, so too do the services that you provide to your users. Gone are the days when a system could just be powered down until the screaming started. Users have come to rely on the complicated mix of products in their environment to perfect their workflows to do the most work possible in the shortest amount of time. So how can these problem devices be dealt with?

Know Your Enemy

First and foremost, you must know about these critical systems. You need to have some kind of monitoring system in place that can see when these devices are performing at peak efficiency or when they aren't doing so well. You need to have a solution that can look outside simple SNMP strings and give you a bigger picture. What if the hard drive in your IPAM system is about to die? What about when the network interface on your VPN concentrator suddenly stops accepting 50% of the traffic headed toward it? These are things you need to know about ASAP so they can be fixed with a minimum of effort.

Your monitoring solution should help you keep track of these devices while giving you plenty of options for alerts. A VPN concentrator isn't going to be a problem if it's offline during the workday. But if it goes down the night before the quarter reports are due, the CFO is going to be calling and need answers. Make sure you can configure your device profiles with alert patterns that give you a chance to fix things before they become problems. Also make sure that the alerts help you keep track of the individual pieces of the solution, not just the up or down status of the whole unit.

Be Ready To Replace

The irony of being stuck with these "problem children" types of devices is that they are the ones that you want to replace more than anything but can't seem to find a way to remove. So how can you advocate for the removal of something so critical?

The problem with these devices is not that the hardware itself is indispensable. It's that the service the hardware (or software) provides is critical. Services can be provided in many different ways. So long as you know what service is being provided, you can create an upgrade path to remove hardware before it gets to the "problem child" level of annoyance.

Most indispensable services and devices get that way because no one is keeping track of who is using them or how they are being used. Workflows created to accomplish a temporary goal often end up becoming a permanent fixture. It's important to keep a record of all the devices in your network and know how often they are being used. Regularly update that list to know what has been recently accessed and for how long. If the device is something that is scheduled to be replaced soon, a preemptive email about the service change will often find a few laggard users that didn't realize they were even using the device. That will help head off any calls after it has been decommissioned and retired to a junk pile.

Every network has problem devices that are critical. The trick to keeping them from becoming real problems lies less in trying to do without them and more with knowing how they are performing and who is using them. With the right solutions in place to keep a wary eye on them and a plan in place to replicate and eventually replace the services they provide, you can sleep a bit better a night knowing that your problem children will be a little less problematic.

Welcome to June! 2016 is almost half over so now might be a good time to go back to those resolutions from New Year's Day and make some edits so you won't feel as awful when December comes around.

 

Anyway, here is this week's list of things I find amusing from around the internet...

 

LinkedIn's Password Fail

Let's turn lemons into lemonade here and look at it this way: LinkedIn found a way to get everyone to remember LinkedIn still exists!

 

Massive Infographic of Star Wars A New Hope

WARNING: DO NOT CLICK UNLESS YOU HAVE TIME TO KILL AND TIME TO SCROLL!

 

Microsoft Cuts More Jobs In Troubled Mobile Unit

After reading this I had but one question: Microsoft still has a mobile unit? I've never met a non-Microsoft employee with a Windows Phone. Never.

 

Microsoft bans common passwords that appear in breach lists

This is wonderful to see, and a great use of taking data they are collecting and using it in a positive way for the benefit of their customers.

 

The Mind-Boggling Pace of Computing

The PS4 will have 150x the computing power of IBM's Deep Blue. Eventually there will come a time when technology will become obsolete the second it appears.

 

China's 'Air Bus' Rides Above The Traffic

I don't see this as a practical upgrade to public transportation, but I do give it high marks for creativity.

 

Ad blocking: reaching a tipping point for advertisers, publishers and consumers

I do not understand why anyone is surprised that consumers don't want to see annoying ads. The content creators should be finding ways to engage me in a way that I want to see their content.

 

a - 1.jpg

fix_computer.jpg

 

Some of the best conversations I've had at trade shows started with a button.

 

People swing by our booth at VMWorld, MS:Ignite, or CiscoLive (psssst! We'll be there again this coming July 10-14). People wander by and their eye is caught by a button proclaiming:

I_am_root.jpg

 

or a sticker that says

Your_F1.jpg

 

...and they have to have it. And then they have to talk about why they have to have it. And before you know it, we're having a conversation about monitoring, or configuration management, or alert trigger automation.

 

And that's what conventions are all about, right? Finding common ground, connecting with similar experiences, and sharing knowledge and ideas.

 

It has gotten to the point where (and I swear I'm not making this up) people rush up to the booth to dig through our buckets of buttons and piles of stickers, looking for the ones they don’t have, yet.

 

Here’s a secret: One of my favorite things about working at SolarWinds is that we have whole meetings dedicated to brainstorming new phrases. I get a rush of excitement when something I suggested is going to be on the next round of convention swag.

 

And for those of us who have the privilege of going to shows and seeing people eagerly snatch up their new prize? The pride we feel when OUR submission is the thing that is giving attendees that burst of joy is akin to watching your baby take her first steps.

 

We want to share that experience with you, our beloved THWACK community.

 

Now this is nothing new. We’ve made requests for slogans before, most recently here and here. The difference is that we’re turning this into an ongoing “campaign” on Thwack.

 

Submit your slogan suggestions for buttons, stickers, or even t-shirts here: https://thwack.solarwinds.com/community/solarwinds-community/fun-and-geeky. Once a quarter a top-secret panel of judges will convene and review the entries, picking the top 5 entries.

 

If one of your submissions is chosen to be immortalized, you will receive - along with immeasurable pride and boundless joy – 1,000 THWACK points. Not only that, but you will earn your slogan as a badge in your THWACK profile. Best of all, you can submit and win as many times as your creativity and free time at work will allow.

 

So let those creative juices flow and unleash your enthusiasm for all things IT and geeky. We can't wait to see what you come up with.

Filter Blog

By date: By tag: