In 2015, I introduced the DART Framework as a series of skills that virtualization administrators can leverage to master their virtual universe. In 2016, it’s time to take your IT career flight beyond the final frontiers of your virtual universe. Just do it with the SOAR Framework: [Updated with hyperlinks to SOAR articles.]

  • Secure - Govern, control data, app, stack, and user planes
  • Optimize - Run more efficiently & effectively
  • Automate - Scale IT
  • Report - Show & tell to the leadership team


SOAR-ing skill set

Security should be top of mind with every IT professional – we are all responsible for security ops whether directly or indirectly. Securing IT delves into governance, compliance and control of data, applications, stacks and user planes.


Optimization boils down to maximizing the return on investment (ROI) of IT. We all get that IT budgets are getting squeezed or being diverted into new investment areas. Optimizing allows IT professionals to do more with less in their data center environment. If done well, it highlights command and control over any given data center ecosystem and opens the door to many new career opportunities.


Automation is the skill that allows IT professionals to scale both their data center and their career aspiration. Whether it’s through scripts, workflows, templates or blueprints, automation is a skill that reclaims the most important resource for any IT pro – time.


Reporting is the least glamorous IT skill; but it’s the one that will most likely get you promoted. Essentially, it revolves around communicating how great of a job you are doing managing your data center efficiency or making your case to get the necessary tools to deliver what the business needs.


February S.O.A.R.s

Every Friday in February, I will publish an article on each specific SOAR skill with an example of what good looks like in a virtual environment. P.S. it can be applied to any tech construct and tech domain. Time to shatter the shackles of silos!


Reference [Updated with hyperlinks to SOAR articles]

If you follow Geek Speak, you're probably aware of the annual "Head Geek IT Pro-Dictions." That said, we thought it would be fun to have the Head Geeks revisit last years predictions. Last year certainly offered countless technology trends and topics to discuss, deliberate, dissect, and debate.

But, were the Geeks right or wrong with their predictions? Check out the video below, as the Head Geeks discuss if they deserve a cheers for their tech predictions or an unfortunate series of jeers. And if you haven't seen this years IT Pro-dictions, be sure to check them out at the end of the video to find out what they believe will be the top ten industry drivers of 2016.

As always, your comments are welcomed, so feel free to chime in and have some fun with the Head Geeks!







January 28 is Data Privacy Day (DPD). Observed since 2008, DPD brings awareness to the importance of data privacy and protection. According to the Verizon 2015 Data Breach Investigations Report, about 60% of cyber attackers are able to compromise an organization’s data within minutes. 2016 is going to be no different from a threats perspective, and data thefts are bound to happen. However, you can minimize the possibility of a cyberattack or data privacy incident by strengthening network security and following some simple security tips.


Centralize monitoring and control: Continuously monitor your network and get a centralized view of the hundreds of security incidents happening in real-time. This is one of the most basic requirements if your organization is required to follow industry-standard compliance regulations like HIPAA, PCI DSS, etc.

Embrace data-driven forensics: Data-driven analysis of a suspicious event will result in better root cause analysis and forensics. A suspicious event can be as trivial as an increase in Web traffic from a known host during specific non-business hours over the last seven days, or repeat connection requests to critical assets (servers, databases, etc.) from an unknown host outside the network. Considering the worst case scenario that an attack has happened, you must be able to trace it back to the source, establish an audit trail, and document the findings and the action taken.

Watch out for malicious software: A term we may see more often in 2016 is ransomware. Sensitive data is the main driver behind these types of malicious software penetrating the network, and a regular user can become an unsuspecting victim of this attack, spreading it to other computers/applications inside the network. Though anti-virus and anti-malware software can be installed to protect the systems, you should set processes in place that will alert you to suspicious application and file activities. Also, you must consider the fact that subtle file and registry changes are hard to detect without file integrity monitoring tools, and zero-day malware attacks dwell on this advantage.

Educate your users/colleagues: Patient records and credit card information are critical data. However, other data, such as social security numbers, ATM passcodes, and bank account names stored on an unprotected desktop or document creates a prime opportunity for private data leaks. Periodic mailers and knowledge sharing among peers and with users can relatively improve your organization’s security.


You can learn more about the Data Privacy Day here.


Do you think it’s time to stop, think, and formulate an effective data privacy policy for your organization? What plans do you have to improve data privacy in your organization in 2016? What roadblocks do you foresee that will stop or slow you down from implementing some right away? Write in and let me know.

I read an interesting thread the other day about a network engineer that tried to use an automated tool to schedule a rolling switch upgrade. He realized after it completed and the switches restarted that he had the wrong image for the device and they weren't coming back up. It was about fifty switches in total, which resulted in a major outage for his organization.

What struck me about the discussion thread was first that he wondered why the tool didn't stop him from doing that thing. The second was that the commenters responded that it wasn't the tool's job to sanity check his inputs. The end result was likely a severe discipline discussion on the part of the original engineer.

Tools are critical in network infrastructure today. Things have become far too complicated for us to manage on our own. SolarWinds makes a large number of tools to help us keep our sanity. But it is the fault of the tool when it is used incorrectly?

Tools are only as good as their users. If you smash your fingers over and over again with a hammer, does that mean the hammer is defective? Or is the fact that you're holding it wrong in play? Tools do their jobs whether you're using the correctly or incorrectly.

2016 is the year when we should all stop blaming the tools for our problems. We need to reassess our policies and procedures and find out how we can use tools effectively. We need to stop pushing all of the strange coincidences and problems off onto software that only does what it's told to do. Software tools should be treated like the tools we use to build a house. They are only useful if used properly.

Best practices should include sanity checking of things before letting the tool do the job. A test run of an upgrade on one device before proceeding to do twenty more. A last minute write up of the procedure before implementing it. Checking the inputs for a monitoring tool before swamping the system with data. Tapping the nail a couple of times with the hammer to make sure you can pull your fingers away before you start striking.

It's a new year full of new resolutions. Let's make sure that we stop blaming our tools and use them correctly. What do you think? Will you resolve to stop blaming the tools? Leave me a comment with some of your biggest tool mishaps!

Will this be the year of Infrastructure-As-Code (Infra-As-Code or IAC) becoming more mainstream? Or is this just a wishful journey that will never catch on? Obviously, this is not a new thing but how many companies have adopted this methodology? Or better yet, how many companies have even begun the discussions of adopting this? Do you currently write scripts and save them somewhere and think “Hey I/we are doing Infra-As-Code already”? Well if true then you are not correct. “But why?” you might be thinking. Infra-As-Code is much larger and more dynamic than just writing scripts in a traditional static method. But if you do currently utilize scripts for infrastructure related tasks and configurations, then you are much better off than those who have not began this journey at all. The reason being is that taking an automated and more programmatic approach of configurations on your infrastructure instead of a manual prone to errors approach is a much more predictable and consistent method of configuring your infrastructure components. Now these components can be of numerous types such as servers, network routers or switches, storage components and much more. But for this series of posts we will only be focusing on the network components and how we can look at beginning the journey towards Infra-As-Code.


Below are some of the topics that we will be covering over the series of posts.



So what does Infra-As-Code really mean? Let’s go ahead and address this here in this post and get a good foundation of what it really does mean.


If you begin to treat your infrastructure as code, you can begin to develop the processes in which allow you to deliver a fully tested, repeatable, consistent and deliverable methodology for configuration state in your environment. In doing this you can begin looking at these items as a pipeline of continuous delivery. Furthermore, allowing for automation tooling to consistently bring your infrastructure to the desired state which has been defined. As you begin this journey you will start with a baseline and then a desired state. Adopting chosen automation tooling for your environment, version control, code review and peer reviews will allow for a much more stable and speedy deployment. As well as, allowing for easier roll-backs in the off chance that something does not go as planned. But the chance of having to roll-back should be minimal assuming that proper testing of configurations has been followed throughout your pipeline delivery.


I know this all sounds great (on paper) and feels a little unrealistic in many aspects but in the next post we will begin to discover on how we can get started. And hopefully by the end of this series you will have a much better understanding and realistic view on how you too can begin the Infra-As-Code journey!

Remember, back in the day, when you’d go to a website and it was down? Yes, down. We’ve come a long way in a short time. Today it’s not just down that is unacceptable, users find it unacceptable and get frustrated if they have to wait more than three seconds for a website to load.


In today’s computing environments, slow is the new down and a slow application in a civilian agency means lost productivity, but a slow military application in theater can mean the difference between life and death. Due to a constantly increasing reliance on mission critical applications, the government must now meet, and in most cases surpass, the high performance standards that are being set by the commercial industry, and the stakes continue to get higher.


Most IT teams focus on the hardware, after blaming and ruling out the network of course. If an application is slow, the first thought is to add hardware: more memory, faster processors, upgrade storage to SSD drives, etc. – to combat the problem. Agencies have spent millions of dollars throwing hardware at application performance issues without a good understanding of the bottleneck slowing down an application.

However, according to a recent survey on application performance management by research firm Gleanster, LLC, the database is the number one source of issues with application performance, in fact 88 percent of respondents cite the database as the most common challenge or issue with application performance.


Trying to identify database performance issues poses several unique challenges:

  • Databases are complex. Most people think of a database as this mysterious black box of secret information and are wary to dig too deep.
  • There are a limited number of tools that assess database performance. Tools normally assess the health of a database (is it working, or is it broken?) and don’t identify and help remediate specific database performance issues.
  • Database monitoring tools that do provide more information don’t go that much deeper. Most tools send information in and collect information from the database, with little to no insight about what happens inside the database that can impact performance.

To successfully assess database performance and uncover the root cause of application performance issues, IT pros must look at database performance from an end-to-end perspective.


In a best-practices scenario, the application performance team should be performing wait-time analysis as part of their regular application and database maintenance. A thorough wait-time analysis looks at every level of the database—from individual SQL statements to overall database capacity—and breaks down each step to the millisecond.


The next step is to look at the results, then correlate the information and compare. Maybe the database spends the most time writing to disk; maybe it spends more time reading memory.


Ideally, all federal IT shops should implement regular wait-time analysis as a baseline of optimized performance. Knowing how to optimize performance—and understanding that it may have nothing to do with hardware—is a great first step toward staying ahead of the growing need for instantaneous access to information.

Read an extended version of this article on GCN

With its ongoing effort toward a Joint Information Environment, the Defense Department is experiencing something that’s extremely familiar to the enterprise world: a merger. The ambitious effort to consolidate communications, services, computing and enterprise services into a single platform is very similar to businesses coming together and integrating disparate divisions into a cohesive whole. Unlike a business merger, however, JIE will have a major impact on the way the DOD IT is run, ultimately providing better flow of and access to information that can be leveraged throughout all aspects of the department.


When JIE is complete, DOD will have a single network that will be much more efficient, secure and easier to maintain. IT administrators will have a holistic view of everything that’s happening on the network, allowing them to pinpoint how one issue in a specific area can not only be detrimental to that portion of the network but also how it impacts other areas.


The JIE’s standard security architecture also means that IT managers will be able to more easily monitor and corner potential security threats and respond to them more rapidly. The ability to do so is becoming increasingly important, as is evidenced by our recent survey, which illustrated the rise of cybersecurity threats.


As DOD kicks the JIE process into high gear, they are establishing Joint Regional Security Stacks (JRSS) which are intended to increase security and improve effectiveness and efficiency of the network. However, the network will still be handling data from all DOD agencies and catering to thousands of users, making manual network monitoring and management of JRSS unfeasible. As such, IT pros will want to implement Network Operations (NetOps) processes and solutions that help support the efforts toward greater efficiency and security.


The process should begin with an assessment of the current NetOps environment. IT pros must take inventory of the monitoring and management NetOps tools that are currently in use and determine if they are the correct solutions to help with deploying and managing the JIE.


Network managers should then explore the development of a continuous monitoring strategy, which can directly address DOD’s goals regarding efficiency and security.


Three key requirements to take into account in planning for continuous monitoring in JIE are:


  • Optimization for dual use. Continuous network monitoring tools, or NetOps tools, can deliver different views of the same IT data while providing insight and visibility to the health and performance. When continuous monitoring is implemented with “dual use” tools they can serve two audiences simultaneously. 
  • Understanding who changed what. With the implementation of JIE, DOD IT pros will be responsible for an ever-expanding number of devices connected to the network, and this type of tool enables bulk change deployment to thousands of devices.
  • Tracking the who, what, when and where of security events. Security information and event management (SIEM) tools are another particularly effective component of continuous monitoring, and its emphasis on security and could be an integral part of monitoring JRSSs. SIEM capabilities enable IT pros to gain valuable insight into who is logging onto DOD’s network and the devices they might be using, as well as who is trying to log in but being denied access.


Like any merger, there are going to be stumbling blocks along the way to the JIE’s completion, but the end result will benefit many – including overworked IT pros desperate for greater efficiency. Because while there’s no doubt the JIE is a massive undertaking, managing the network that it creates does not have to be.


To read an extended version of this article, visit Defense Systems

Well, here we are again, at the start of a New Year. This is the time for everyone to list out his or her goals and resolutions, and be reminded of how miserably they failed at such things over the previous year.

Here are my resolutions for 2016. You're welcome.

Lose wait

We all hate to wait.

We hate waiting for traffic, for the next episode of Sherlock, or for Cleveland to have a winning football team.

You know what else we hate waiting for? Data. We hate waiting for a database to return results from simple queries and reports. We hate not knowing why a report has not finished. And we hate waiting for the database administrator to fix things.

So this year I’m going to take the time to understand more about what my queries are waiting for. I’m going to take the time to learn about wait events, resource bottlenecks, and the options available to help tune the queries as needed.

Sort out my inbox

Application systems become more complex with each passing day, and they require additional monitoring along the way. The end result is an inbox full of alerts.

I’m tired of the clutter, and so are you.

So in 2016, I will find a way to make sure that I am only being alerted for things that require action. I will start by digging into the alert system to see if I can find out why the alert in question was generated. Then I can start being proactive in my work so that the amount of time needed to react to alerts is minimized.

I know that one-hour of time being spent weekly in a proactive way can save me up to three hours of time I now spend weekly in a reactive mode. That’s a lot of extra time I can better spend looking at pictures of cats and arguing with strangers on the Internet.

Be nicer to my coworkers

Just putting the word DevOps and emojis into emails isn’t enough. In 2016, I’m going to find a way to use monitoring tools to help facilitate communications. I will start to communicate ideas using reports based on the data our tools are collecting.

And by doing so I will help reduce the number of “blamestorming” meetings that happen frequently right now.

Now, these are my resolutions for 2016, feel free to use them for yourself if you want. But I’d encourage you to think about your own resolutions for the upcoming year. Think about the ways you can make things better for yourself and for others and then put those resolutions to work. How? By entering them in the 2016 IT Resolutions contest.

Meanwhile, I want to hear your thoughts about my list, your list, or your IT plans for 2016 in general, in the comments below.


Here in these early days of January, it feels the same way weekends did on Saturday mornings when I was 8 years old—a giant bowl of Sugar Frosted Choco-Bombs in my lap, cartoons on TV, and hour after hour of joyful opportunity spread out in front of me.


However, in the years since I was 8, I have learned a few things:

  • Don't turn up the volume on the TV before it turns on or it wakes up your parents.
  • 2 bowls of cereal is awesome; 4 is too many.
  • Carry the milk with both hands even when you are sure you can do it with one.
  • Make plans or all those hours disappear before you know it. Then it's Monday again and you are explaining to Mrs. Tabatchnik why the answer for all your math homework problems is 12.


In the spirit of that last point, making plans, now is the perfect time to set some goals. One might even call them "resolutions" for things that should be on your 2016 bucket list. Here are 4 suggestions of things that should be on yours.


Turn off the noise
My first 2016 resolution suggestion is pure #MonitoringGlory. Nobody wants to get an automated ticket, email, or text for something that isn't actually a problem, whether it comes in the middle of the day or at 2 am. Resolve to spend some quality time with your alert triggers and their results. Does the trigger logic identify a real, measurable, actionable problem, or is it an "FYI alert" that merely pesters an actual human to go check and see if something is ACTUALLY wrong? Now dig into the results over the last year. Did this alert generate storms of alerts? Almost none at all? What did people do when the alert came in?


All of these questions will help you create a better, more meaningful alert. This leads to the recipient of the alert believing it more, which leads to better responsiveness.


Enable IPv6

In the 20 years since the protocol was released, IPv6 has netted only a 10% adoption in the workplace. With the oncoming storm of SDN, IoT, and BYOE—not to mention the general growth of networks and network complexity—there are alarming security risks in NOT understanding what is and isn't IPv6-enabled in your environment (and what it's doing). Finally, with the not-so-modest gains to be made with IPv6 in the area of clustered servers, domain controllers, multicast, and more, this is the time to get in front of the curve and start planning, and even implementing IPv6.


Commit to learning and testing now so you aren't under the gun when it's really crunch time.


Commit to security

In the same vein, your IT resolutions should include at least one security-related commitment. Maybe you make friends with the audit team for once. Maybe you scan your network device configurations and see if they meet SOX or DISA STIG standards. Maybe you use NetFlow or Deep Packet Inspection to identify the types of traffic on your network (as well as the source and destination of that traffic).


Heck, even just choosing and using a password manager for your own personal accounts would be a great start. If for no other reason than it would get you thinking about all the OTHER users in your organization and how they are managing their passwords. Which, as we saw throughout 2015, was the first line of defense to fail in every major breach.


Whatever it is, don't let security be someone else's responsibility this year.


Know the value of your monitoring

Coming back around to monitoring for my last point, commit to taking the time to understand what monitoring provides you. What I mean by that is, every time a specific alert triggers, what have you saved in terms of minutes of outage, staff time, and/or predictive vs reactive repair costs?


Calculating this may be time consuming, but it's not complex, as I've described in the past (


Why should this be on your 2016 resolutions? Because it helps you identify which tools, monitors, and alerts cost your company (in time to create, maintain, and respond to) more than they are worth, and which have a high return on investment. Not only that, but doing this for existing monitors helps you evaluate which of your upcoming requests is most worth digging into.


Finally, having these numbers handy gives you the ammunition you need to face the bosses and bean counters when you request additional licenses—they need a justification.


Because I don't know about where you work, but it feels like my whole management and purchasing team is related to old Mrs. Tabatchnik.

Those are MY recommendations for what you should have on your 2016 IT resolutions, but your list probably looks a lot different. You really should put those resolutions to work. How? By entering them in the 2016 IT Resolutions contest ( Meanwhile, I want to hear your thoughts about my list, your list, or your IT plans for 2016 in general, in the comments below.

Network monitoring has relied historically on SNMP as a primary means of gathering granular statistics. SNMP works on a pull model. The network monitoring station reaches out and pulls value from OIDs, and then reacts to that data. There are also monitoring options where a network device pushes statistics to data collectors such as network management stations, flow collectors, or syslog engines.

In researching Ethernet switches, I've run across the term telemetry that describes datasets coming from these devices. Vendors are positioning telemetry as if it is some new feature that you need to be on the lookout for.

So, is telemetry something new? In digging through vendor literature, watching presentations, and talking to one of my vendor contacts specifically, I’ve concluded network telemetry represents both old and new forms of network statistics, and new ways of gathering and exposing data.

First, in presentations, it's clear that some networking vendors use the term “telemetry” generically. As they work through their demo, they display, for example, sFlow and syslog data. Those are not new data formats to network engineers. We know flow data in its various formats, including sFlow. We also know syslog well. And we also know that those formats typically contain information pushed to our data collectors in real-time or near real-time.

However, I do think that for some vendors, telemetry is more than a fancy way to describe the same old data. For instance, Juniper Networks shared several facts with me about their Junos Telemetry Interface that are a bit different than what network engineers might be used to. Here are the more relevant points:

  • Junos telemetry is streamed in a push model, like syslog or flow data.
  • Juniper uses Google's Protobuf message format to stream the data. Protobuf is interesting. The big idea according to Google is to define your message format and fields, and then compile code optimized to read that data stream. This means that Juniper doesn't have to shoehorn telemetry into a format that might be ill-suited to the data. They can build their structured message format and optimize it however they like, and extend as they go.
  • Juniper is not exposing every conceivable value via their telemetry interface (which proprietary SNMP vendor MIBs tend to do). Rather, they’ve focused on performance management data: I/O & error counters, queue statistics, and so on.
  • The Junos telemetry interface is open to anyone that wants to parse the data. Therefore, any vendor that wishes to create a custom application for end users could work with Juniper, get the data format details, and go to town.

Other vendors that come up when talking about telemetry include Cisco with their ACI data center fabric, and Arista with the telemetry interface in their EOS operating system. While I don't have specific details on how Cisco and Arista telemetry interfaces might differ from the Junos telemetry interface, they all seem to emphasize the near real-time pushing of descriptive network data to a collector that can aggregate the data and present it to a network operator.

So whether the term telemetry is being used generically to mean "data from the network" or specifically to mean "pushing specific network metrics to a data collector," I believe it's a term we're going to see used more and more.

While the data gathered via telemetry might be familiar, I believe the method used to gather the data, as well as what's being done with that data, is where the magic lies.

This begs another question. Could network telemetry be the end of SNMP? While my crystal ball remains murky, I believe SNMP has a long run still ahead of it. To supplant the familiar and ubiquitous SNMP, vendors will need to get their heads together on just exactly what this new telemetry format should be.

From what I can tell looking at just three vendors — Cisco, Juniper, and Arista — network telemetry is implemented differently for each of them. Differences slow technology adoption, as the variant solutions place monitoring vendors in the unenviable position of having to pick and choose which telemetry solutions to align themselves with.

Whatever SNMP's shortcomings might be, all you have to do is sort out the OIDs. The industry has already agreed upon the rest.

As 2015 ends, businesses are busy closing deals, evaluating project success, and planning for the New Year. For IT professionals, this transitional period is crucial in building a foundation for success in the upcoming year. This year, I resolve to keep IT stupid simple. It’s guaranteed to KISS away all IT issues.


A Walk in the Clouds, containers and loosely coupled services

It’s easy to get lost in the myriad of new technologies and the associated vendor FUD (fear, uncertainty, and doubt) that fills the IT landscape. It certainly doesn’t make an IT professional’s job any easier when it’s hard to discern between what’s fact or fiction. Especially when one can solve a problem in so many different ways.

Ultimately, what’s the most efficient and effective method to troubleshoot and remediate problems?

Keep IT Stupid Simple

So let’s start with the obviouskeeping IT stupid simple. This means that if you don’t understand the ins and outs of your solution stack; then, it shouldn’t be your solution. When an application or system slows down, breaks or fails (and it will), your job is on the line to quickly root cause and resolve the issue. Keeping IT stupid simple makes troubleshooting and remediation much easier.

The USE Method

For performance bottlenecks, a great and simple framework to follow is the USE Method by Netflix’s Brendan Gregg.

USE stands for utilization, saturation, and errors. Each aspect is defined below:

  1. Utilization – the average time that a resource was busy working
  2. Saturation – the degree to which the resource can no longer do the work often resulting in queue lengths
  3. Errors – count of error events

Think of it as a checklist to keep things simple when troubleshooting and remediating issues. For every key resource, check the utilization, saturation, and errors. These aspects are all interconnecting and provide different clues in identifying bottlenecks.

The complete picture

Utilization covers what’s happening over time, but depending on sampling rate and incident intervals, it may not provide the complete picture. Saturation provides insights on overloaded conditions, but may not show up by just viewing utilization metrics for the reasons mentioned above. And errors give clues on operations that have failed and lead to retries. Combining all three can provide a clear view of what is happening during a performance bottleneck condition.

A proper monitoring tool will collect, aggregate, and visualize utilization while alerting on saturation conditions and logging and correlating errors.

A virtualization example of USE Method

Let’s walk through a simple graphical virtualization example utilizing the USE method focused on one resource metric.

  • Utilization – Let’s examine on the Host CPU utilization of Host – bas-esx-02.lab.tex, which shows a 98% utilization.


Figure 1

  • Saturation – Next, let’s dig in to verify if there was a triggered alert that indicates a potential saturation condition on the Host CPU utilization resource.


Figure 2

  • Errors – Finally, let’s see if this saturated event had any bearing on the host’s availability. And it appears that there was a time window 2 days and 8 hours ago relative to these screen captures that the host server had some availability issues.


Figure 3

The USE method paired with a proper virtualization tool epitomizes the keep IT stupid simple principles as a mean to troubleshoot any potential bottlenecks.


What is your resolution?

So in the upcoming year, what do you resolve to do to complete your data center picture whether they reside on your premises or on cloud? And what tools will you use or do you need to get it done right? Please chime in below in the comment section.

And join the 2016 IT Resolutions Contest!


Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.