Skip navigation
1 6 7 8 9 10 Previous Next

Geek Speak

2,080 posts

I *may* have eaten my weight in turkey and stuffing last week. But the best part about the holiday was how I spent the better part of four days disconnected from just about everything. Disconnecting from time to time was the subject of a talk by adatole recently at a DevOpsDays event. Here's the video if you want to see Leon deliver a wonderful session on a topic that is important to everyone, inside and outside of IT.


Also, here's a bunch of other links I found on the Intertubz that you may find interersting, enjoy!


Great. Now Even Your Headphones Can Spy on You

As if I needed more reasons to be paranoid, apparently even my tinfoil hat won't help stop this threat.


Madison Square Garden, Radio City Music Hall Breached

A. Full. Year.


Shift Your Point of View to When America Was “Better”

Because I love data visualizations, and so should you.


How Long Did It Take To Make Food in Ancient Times?

Pretty sure if I had to wait 20 days to make some coffee my head would explode.


Oracle Bare Metal Cloud: Top Considerations and Use-Cases

The more I read pieces like this, the more I think Oracle is a sinking ship. Take this quote for example: "we begin to see that there is a market for public cloud consumption and the utilization of cloud services". Hey, Larry, it's 2016, most of us knew there was a market 10 years ago. And telling me that your Cloud will be better than other clouds becuase...why exactly?


Ransomware Result: Free Ticket to Ride in San Francisco

Get used to seeing more attacks like this one, but more disruptive. It wouldn't take much to shut down the trains altogether in exchange for a quick payout.


Fake News Is Not the Only Problem

A bit long, and politics aside, the takeaway for me here is the sudden realization by many that the Internet may not be the best source of news.


When your team is stuck, rub a little DevOps on your process and everything will be fine:




I’ve stated often that great database performance starts with great database design. So, if you want a great database design you must find someone with great database experience. But where does a person get such experience?


We already know that great judgment comes from great experience, and great experience comes from bad judgment. That means great database experience is the result of bad judgment repeated over the course of many painful years.


So I am here today to break this news to you. Your database design stinks.


There, I said it. But someone had to be the one to tell you. I know this is true because I see many bad database designs out in the wild, and someone is creating them. So I might as well point my finger in your direction, dear reader.


We all wish we could change the design or the code but there times when it is not possible to make changes. As database usage patterns push horrible database designs to their performance limits database administrators are then handed an impossible task: Make performance better but don’t touch anything.


Imagine that you take your car to a mechanic for an oil change. You tell the mechanic they can’t touch the car in any way, not even open the hood. Oh, and you need it done in less than an hour. Silly, right? Well I am here to tell you that it is also silly to go to your database administrator and say: “we need you to make this query faster and you can’t touch the code”.


Lucky for us the concept of "throwing money at the problem” is not new as shown by this ancient IBM commercial. Of course throwing money at the problem does not always solve the performance issue. This is the result of not knowing what the issue is to begin with. You don’t want to be the one to spend six figures on new hardware to solve an issue with query blocking. Even after ordering the new hardware it takes time before arrival, installation, and the issue resolved.


That's why I put together this list of things that can help you fix database performance issues without touching code. Use this as a checklist to research and take action upon before blaming code. Some of these items cost no money, but some items (such as buying flash drives) might. What I wanted to do was to provide a starting point for things you can research and do yourself.


As always: You’re welcome.


Examine your plan cache

If you need to tune queries then you need to know what queries have run against your instance. A quick way to get such details is to look inside the plan cache. I’ve written before about how the plan cache is the junk drawer of SQL Server. Mining your plan cache for performance data can help you yield improvements such as optimizing for ad-hoc workloads, estimating the correct cost threshold for parallelism, or which queries are using a specific index. Speaking of indexes…


Review your index maintenance

Assuming you are doing this already, but if not then now is the time to get started. You can use maintenance plans, roll your own scripts, or use scripts provided by some Microsoft Data Platform MVPs. Whatever method you choose, make certain you are rebuilding, reorganizing, and updating statistics only when necessary. I’d even tell you to take time to review for duplicate indexes and get those removed.


Index maintenance is crucial for query performance. Indexes help reduce the amount of data that searched and pulled back to complete a request. But there is another item that can reduce the size of the data searched and pulled through the network wires…


Review your archiving strategy

Chances are you don’t have any archiving strategy in place. I know because we are data hoarders by nature, and only now starting to realize the horrors of such things. Archiving data implies less data, and less data means faster query performance. One way to get this done is to consider partitioning. (Yeah, yeah, I know I said no code changes; this is a schema change to help the logical distribution of data on physical disk. In other words, no changes to existing application code.)


Partitioning requires some work on your end, and it will increase your administrative overhead. Your backup and recovery strategy must change to reflect the use of more files and filegroups. If this isn’t something you want to take on then instead you may instead want to consider…


Enable page or row compression

Another option for improving performance is data compression at the page or row level. The tradeoff for data compression is an increase in CPU usage. Make certain you perform testing to verify the benefits outweigh the extra cost. For tables that have a low amount of updates and a high amount of full scans then data compression is a decent option. Here is the SQL 2008 Best Practices whitepaper on data compression which describes in detail the different types of workloads and estimated savings.


But, if you already know your workload to that level of detail, then maybe a better option for you might be…


Change your storage configuration

Often this is not an easy option, if at all. You can’t just wish for a piece of spinning rust on your SAN to go faster. But technology such as Windows Storage Spaces and VMware’s VSAN make it easy for administrators to alter storage configurations to improve performance. At VMWorld in San Francisco I talked about how VSAN technology is the magic pixie dust of software defined storage right now.


If you don’t have magic pixie dust then SSDs are an option, but changing storage configuration only makes sense if you know that disk is your bottleneck. Besides, you might be able to avoid reconfiguring storage by taking steps to distribute your I/O across many drives with…


Use distinct storage devices for data, logs, and backups

These days I see many storage admins configuring database servers to use one big RAID 10, or OBR10 for short. For a majority of systems out there the use of OBR10 will suffice for performance. But there are times you will find you have a disk bottleneck as a result of all the activity hitting the array at once. Your first step is then to separate out the database data, log, and backup files onto distinct drives. Database backups should be off the server. Put your database transaction log files onto a different physical array. Doing so will reduce your chance for data loss. After all, if everything is on one array, then when that array fails you will have lost everything.


Another option is to break out tempdb onto distinct array as well. In fact, tempdb deserves its own section here…


Optimize tempdb for performance

Of course this is only worth the effort if tempdb is found to be the bottleneck. Since tempdb is a shared resource amongst all the databases on the instance it can be a source of contention. But we operate in a world of shared resources, so finding tempdb being a shared resource is not a surprise. Storage, for example, is a shared resource. So are the series of tubes that makes up your network. And if the database server is virtualized (as it should be these days) then you are already living in a completely shared environment. So why not try…


Increase the amount of physical RAM available

Of course, this only makes sense if you are having a memory issue. Increasing the amount of RAM is easy for a virtual machine when compared to having to swap out a physical chip. OK, swapping out a chip isn’t that hard either, but you have to buy one, then get up to get the mail, and then bring it to the data center, and…you get the idea.


When adding memory to your VM one thing to be mindful about is if your host is using vNUMA. If so, then it could be the case that adding more memory may result in performance issues for some systems. So, be mindful about this and know what to look for.


Memory is an easy thing to add to any VM. Know what else is easy to add on to a VM?


Increase the number of CPU cores

Again, this is only going to help if you have identified that CPU is the bottleneck. You may want to consider swapping out the CPUs on the host itself if you can get a boost in performance speeds. But adding physical hardware such as a CPU, same as with adding memory, may take too long to physically complete. That’s why VMs are great, as you can make modifications in a short amount of time.


Since we are talking about CPUs I would also mention to examine the Windows power plan settings, this is a known issue for database servers. But even with virtualized servers resources such as CPU and memory are not infinite…


Reconfigure VM allocations

Many performance issues on virtualized database servers are the result of the host being over-allocated. Over-allocation by itself is not bad. But over-allocation leads to over-commit, and over-commit is when you see performance hits. You should be conservative with your initial allocation of vCPU resources when rolling out VMs on a host. Aim for a 1.5:1 ratio of vCPU to logical cores and adjust upwards from there always paying attention to overall host CPU utilization. For RAM you should stay below 80% total allocation, as that allows room for growth and migrations as needed.


You should also take a look at how your network is configured. Your environment should be configured for multi-pathing. Also, know your current HBA queue depth, and what values you want.



We’ve all had times where we’ve been asked to fix performance issues without changing code. The items listed above are options for you to examine and explore in your effort to improve performance before changing code. Of course it helps if you have an effective database performance monitoring solution in place to help you make sense of your environment. You need to have performance metrics and baselines in place before you start turning any "nerd knobs", otherwise you won't know if you are have a positive impact on performance no matter which option you choose.


With the right tools in place collecting performance metrics you can then understand which resource is the bottleneck (CPU, memory, disk, network, locking/blocking). Then you can try one or more of the options above. And then you can add up the amount of money you saved on new hardware and put that on your performance review.

Last week, we talked about monitoring the network from different perspectives. By looking at how applications perform from different points in the network, we get an approximation of the users' experience. Unfortunately, most of those tools are short on the details surrounding why there's a problem or are limited in what they can test.

On one end of our monitoring spectrum, we have traditional device-level monitoring. This is going to tell us everything we need to know that is device-specific. On the other end, we have the application-level monitoring discussed in the last couple of weeks. Here, we're going to approximate a view of how the end users see their applications performing. The former gives us a hardware perspective and the latter gives us a user perspective. Finding the perspective of the network as a whole is still somewhere between.

Using testing agents and responders on the network at varying levels can provide that intermediate view. They allow us to test against all manner of traffic, factoring in network latency and variances (jitter) in the same.

Agents and Responders

Most enterprise network devices have built-in functions for initiating and responding to test traffic. These allow us to test and report on the latency of each link from the device itself. Cisco and Huawei have IP Service Level Agreement (SLA) processes. Juniper has Real-Time Performance Monitoring (RPM) and HPE has its Network Quality Analyzer (NQA) functions, just to list a few examples. Once configured, we can read the data from them via Simple Network Management Protocol (SNMP) and track their health from our favourite network monitoring console.

Should we be in the position of having an all-Cisco shop, we can have a look at SolarWinds' IP SLA Monitor and VoIP and Network Quality Manager products to simplify setting things. Otherwise, we're looking at a more manual process if our vendor doesn't have something similar.


Observing test performance at different levels gives us reports of different granularity. By running tests at the organization, site and link levels, we can start with the bigger picture's metrics and work our way down to specific problems.


Most of these will be installed at the edge devices or close to them. They will perform edge-to-edge tests against a device at the destination organization or cloud hosting provider. There shouldn't be too many of these tests configured.


Site-to-site tests will be configured close to the WAN links and will monitor overall connectivity between sites. The point of these tests is to give a general perspective on intersite traffic, so they shouldn't be installed directly on the WAN links. Depending on our organization, there could be none of these or a large number.


Each network device has a test for each of its routed links to other network devices to measure latency. This is where the largest number of tests are configured, but is also where we are going to find the most detail.


Agent and responder testing isn't passive. There's always the potential for unwanted problems caused by implementing the tests themselves.


Agent and responder tests introduce traffic to the network for purposes of testing. While that traffic shouldn't be significant enough to cause impact, there's always the possibility that it will. We need to keep an eye on the interfaces and queues to be sure that there isn't any significant change.

Frequency and Impact

Running agents and responders on the network devices themselves are going to generate additional CPU cycles. Network devices as a whole are not known for having a lot of processing capacity. So, the frequency for running these tests may need to be adjusted to factor that in.

Processing Delay

Related to the previous paragraph, most networking devices aren't going to be performing these tests quickly. The results from these tests may require a bit of a "fudge factor" at the analysis stage to account for this.

The Whisper in the Wires

Having a mesh of agents and responders at the different levels can provide point-in-time analysis of latencies and soft failures throughout the network. But, it needs to be managed carefully to avoid having negative impacts to the network itself.

Thanks to Thwack MVP byrona for spurring some of my thinking on this topic.

Is anyone else building something along these lines?

For government agencies, network monitoring has evolved into something extremely important, yet unnecessarily complex. For instance, according to Gleanster Research, 62 percent of respondents use on average three separate monitoring tools to keep their networks safe and functioning properly.


Network monitoring tools have become an integral part of agencies’ IT infrastructures, as they allow administrators to more easily track overall network availability and performance. All of this can be handled in real-time and with accompanying alerts, making network monitoring a must for agencies seeking to bolster their security postures.


Below, we’ll break down three monitoring techniques that will help you get a handle on how effective network monitoring can solve numerous problems for your agency.


Slay Problems through IP SLA


IP SLA – or short for Internet Protocol Service Level Agreements – sounds complex. But in reality its function is a simple one: ensuring the voice-over-IP (VoIP) environment is healthy. IP SLA allows IT administrators to set up certain actions to occur on a network device and have the results of that operation reported back to a remote server.


For example, the operation may include checking if a Web page or DNS server is responding, or whether a DHCP server is responding and handing out IP addresses. This is a huge asset because it uses the existing devices within the network infrastructure rather than requiring you to set up separate devices (or agents on existing PCs or servers) to run tests.


Trace the NetFlow of “Conversations”


NetFlow has the ability to capture network “conversations” for you. NetFlow data is captured by one or more routers operating near the center of the network.


Simply put, if DesktopComputer_123 is sending a file to Server_ABC via FTP, that is one conversation. The same PC browsing a webpage on the same server using HTTP is another conversation. NetFlow operates in the middle of these conversations to collect data so that the monitoring server can then aggregate, parse, and analyze the data.


Hook Into API Monitoring


Using a network monitoring Application Protocol Interface (API) can be the murkiest of all of the techniques we’ve discussed. In essence, to understand how API is used, you must realize that there are hooks built into applications that allow for data requests. Each time this type of request is received, a response is sent back to the monitoring software, giving you a better understanding of how your network is performing. Microsoft System Center Operations Manager (SCOM) is a proprietary example of a network monitoring API, while VMware’s API is published and generally available.


Make no mistake — maintaining network security in today’s environment is more complex and crucial than ever. Having the tools in place – and understanding what tools are out there for federal government agencies – is a must.  But the good news is that these tools do exist.  And with less work than you may have expected, you can quickly understand and appreciate what you can do to crack the case of network security.


Find the full article on our partner DLT’s blog, TechnicallySpeaking.

Over the past 5 postings, I’ve talked about some trends that we have seen happening and gaining traction within the Cloud space. I’ve spoken of :


  • Virtualization – Where established trends toward virtualization, particularly VMWare, have been challenged by a variety of newcomers, who’s market share continues to grow. Most notable here is OpenStack as a hypervisor. VMWare has challenged the threat of Azure, AWS, and true OpenStack by embracing it with a series of API’s meant to incorporate the on-prem Virtual DataCenter with those peers in the hybrid space.


  • Storage – In the case of traditional storage, the trend has been to faster, with faster Ethernet or Fibre as interconnect, and of course, Solid State is becoming the norm in any reasonably high IO environment. But the biggest sea change is becoming that of Object based storage. Object really is a different approach, with replication, erasure encoding, and redundancy built-in.


  • Software Defined Networking – Eating quite drastically into the data center space these days is SDN. The complexities in routing tables, and firewall rules are being addressed within the virtual data center by tools like ACI (Cisco) and NSX (VMWare). While port reduction isn’t quite the play here. The ability to segment a network via these rules far surpasses any physical switch’s capacities. In addition, these rules can be rolled out quite effectively, accurately, and with easy roll-back. I find that these two pieces are truly compelling to the maintaining and enhancing the elegance of the network, while reducing the complexities laid onto the physical switch environment.


  • Containers – In the new world of DevOps, Containers, a way to disaggregate the application from the operating system, have proven yet another compelling way into the future. DevOps calls for the ability to update parts and pieces of an application, while Containers allow for the ability to scale the application, update it, and deploy it wherever and whenever you want.


  • Serverless and MicroServices – Falling into the equation of DevOps, where small components compiled together make up for the entire application, put together as building-blocks make the whole of the application quite dynamic, and modifiable. While the “Serverless” piece, which is somewhat a misnomer (due to the fact that any workload must reside on some compute layer), are dynamic, movable, and reliant less on the hypervisor or location, than wherever the underlying architecture actually resides.


So… What’s next in the data center infrastructure? We’ve seen tools that allow the data center administrator to easily deploy workloads into a destination wherever that may be, we’ve seen gateways that bridge the gap from more traditional storage to object based, we’ve seen orchestration tools which allow for the rapid, consistent, and highly managed deployment of containers in the enterprise/cloud space, and we’ve seen a truly cross-platform approach to serverless/MicroService type of architecture which eases the use of a newer paradigm in the data center.


What we haven’t seen is a truly revolutionary unifier. For example, when VMWare became the juggernaut it did become, the virtualization platform became the tool that tied everything together. Regardless of your storage, compute (albeit X86 particularly) and network infrastructure, with VMWare as a platform, you had one reliable and practically bulletproof tool with which to deploy new workloads, manage existing platforms, essentially scale it up or down as required, and all through the ease of a simple management interface. However, with all these new technologies, will we have that glue? Will we have the ability to build entire architectures, and manage them easily? Will there be a level of fault tolerance; an equivalent to DRS, or Storage DRS? As we seek the new brass ring and poise ourselves onto the platforms of tomorrow, how will we approach these questions?


I’d love to hear your thoughts.

Part 2 of a 3-part series, which is itself is a longer version of a talk I give at conferences and conventions.

You can find part 1 here.

I'd love to hear your thoughts in the comments below!


In the first part of this series, I made a case for why disconnecting some times and for some significant amount of time is important to our health and career. In this segment I pick up on that idea with specific things you can do to make going offline a successful and positive experience.


Don’t Panic!

If you are considering taking time to unplug, you probably have some concerns, such as:

  • how often and for how long should you unplug
  • how do you  deal with a workload that is already threatening to overwhelm you
  • how will your boss, coworkers, friends perceive your decision to unplug
  • how do you maintain your reputation as a miracle worker if you aren’t connected
  • how do you deal with pseudo medical issues like FOMO
  • what about sev1 emergencies
  • what if you are on-call


Just take a deep breath. This isn't as hard as you think.


Planning Is Key

"To the well-organized mind, death is but the next great adventure."

- Albus Dumbledore


As true as these words might be for Nicolas Flamel as he faces his mortality, they are even truer for those shuffling off the mortal coil of internet connectivity. Because, like almost everything else in IT, the decisions you make in the planning phase will determine the ultimate outcome. Creating a solid plan can make all the difference between experiencing boring, disconnected misery and relaxed rejuvenation.


The first thing to plan out is how long you want to unplug, and how often. My advice is that you should disconnect as often, and for as long per session, as you think is wise. Period. It's far more important to develop the habit of disconnecting and experience the benefits than it is to try to stick to some one-size-fits-most specification.


That said, be reasonable. Thirty minutes isn't disconnecting. That’s just what happens when you're outside decent cell service. You went offline for an hour? I call that having dinner with Aunt Frieda, the one who admonishes you with a “My sister didn't raise you to have that stupid thing out at the table." Haven't checked Facebook for two or three hours? Amateur. That's a really good movie, or a really, REALLY good date.


Personally, I think four hours is a good target. But that's just me. Once again, you have to know your life and your limits.


At the other end of the spectrum, unless you are making some kind of statement, dropping off the grid for more than a day or two could leave you so shell shocked that you'll avoid going offline again for so long you may as well have never done it.


One suggestion is to try a no-screens-Sunday-morning every couple of weeks, and see how it goes. Work out the bugs, and the re-evaluate to see if you could benefit from extending the duration.


It's also important to plan ahead to decide what counts as online for you. This is more nuanced that it might seem. Take this seemingly clear-cut example: You plan to avoid anything that connects to the outside world, including TV and radio. There are still choices. Does playing a CD count? If so, can you connect to your favorite music streaming service since it’s really just the collection of music you bought? What about podcasts?


The point here is that you don’t need to have the perfect plan. You just need to start out with some kind of plan and be open-minded and flexible enough to adjust as you go.


You also need to plan your return to the land of the connected. If turning back on again means five hours of hacking through email, twitter feeds, and Facebook messages, then all that hard won rest and recharging will have gone out the window. Instead, set some specific parameters for how you reconnect. Things like:

  • Limit yourself to no more than 30 minutes of sorting through email and deleting garbage
  • Another 30 to respond to critical social media issues
  • Decide which social media you actually HAVE to look at (Do you really need to catch up on Pinterest and Instagram NOW?)
  • If you have an especially vigorous feed, decide how far back (in hours) that you will scroll


As I said earlier, any good plan requires flexibility. These plans are more contingencies than tasks, and you need to adhere to a structure, but also go with the flow when things don't turn out exactly as expected.


Preparation is Key

Remember how I said that Shabbat didn't mean sitting in the dark eating cold sandwiches? Well, the secret is in the preparation. Shabbat runs from Friday night to Saturday night, but a common saying goes something like, "Shabbat begins on Wednesday.” This is because you need time to get the laundry done and food prepared so that you are READY when Friday night arrives.


An artist friend of mine goes offline for one day each week. I asked him what happens if he gets an idea in the middle of that 24-hour period. He said, "I make an effort all week to exhaust myself creatively, to squeeze out every idea that I can. That way I look at my day off as a real blessing. A day to recharge because I need it."


His advice made me re-think how I use my time and how I use work to set up my offline time. I ask myself whether the work I'm doing is the stuff that is going to tear my guts out when I'm offline if it's not done. I also use a variety of tools - from electronic note and to-do systems to physical paper - so that when it's time to drop offline, I have a level of comfort that I'm not forgetting anything, and that I'll be able to dive back in without struggling to find my place.


Good preparation includes communicating your intentions. I'm not saying you should broadcast it far and wide, but let key friends, relatives, and coworkers know that you will be “…out of data and cell range.”


This is exactly how you need to phrase it. You don’t need to explain that you are taking a day to unplug. That's how the trouble starts. Tell people that you will be out of range. Period.


If needed, repeat that phrase slowly and carefully until it sounds natural coming out of your mouth.


When you come back online, the opposite applies. Don't tell anyone that you are back online. Trust me, they'll figure it out for themselves.


In the next installment, I'll keep digging into the specifics of how to make going offline work for you. Meanwhile, if you have thoughts, suggestions, or questions, let me know in the comments below!


(image courtesy of Marvel)


...I learned from "Doctor Strange"

(This is part 1 of what will be a 4-part series. Enjoy!)


"When the student is ready, the teacher appears," is a well-known phrase, but I was struck recently by the way that sometimes the teacher appears in unexpected forms. It's not always the kindly and unassuming janitor, Mr. Miyagi, or the crazy old hermit, Ben Kenobi. Sometimes the teacher isn’t a person or a character, but an entire movie filled with lessons for ready students. 


I found myself in that situation recently, as I sat watching Dr. Strange, the latest installment in the Marvel cinematic universe.


There, hidden among the special effects, panoramic vistas, and Benedict Cumberbatch's cheekbones were some very real and meaningful IT career lessons, applicable to both acolytes and masters as they walk the halls of your own technological Kamar Taj. In fact, I discovered a series of lessons, more than I can fit into just one essay.


So, over the next couple of installments I'm going to share them with you, and I’d like to hear your thoughts and reactions in the comments below.


If it needs to be said, there are many spoilers in what follows. If you haven't seen the movie yet, and don't want to know what's coming, bookmark this page to enjoy later.


Know the essential tools of the trade

The movie introduces us to the concept of a sling ring, a magical device that allows a sorcerer to open a portal to another location. In the narrative arc of the movie, this appears to be one of the first and most basic skills sorcerers are taught. It was also the key to many of the plot twists and a few sight gags in the movie. In my mind, I equated the concept of the sling ring with the idea that all IT pros need to understand and master basic skills, such as IP subnetting, command line syntax, coding skills, and security.


Can you be a solid IT pro without these skills? Sure, but you'll never be a master, and odds are good that you'll find yourself hanging around the lower end of the career ladder far longer than you’d like.


Think creatively about how to use the technology you already have

In the movie, immediately after figuring out how to use a sling ring, we see the hero use it in non-standard ways. Instead of opening a portal for his whole body, he opens holes just big enough for his hands, so that he can borrow books from the library and avoid being detected by Wong the librarian. We see this again in the use of the Eye of Agamotto during Doctor Strange's face-off against Dormamu.


The great thing about essential IT skills is that they can be used in so many ways. Understanding network routing will allow you to build stronger and more secure environments in the cloud. A grasp of regular expressions will help you in coding, in using various tools, and more. Understanding the command line, rather than being trapped in the GUI all the time, allows you to automate tasks, perform actions more quickly, and extend functionality.


It's worth noting that here at SolarWinds we place great stock in enabling our users to think outside the box. We even have a SolarWinds User Group (SWUG) session on doing just that – called “Thinking Outside the Box”.


Don't let your desire for structure consume you

In the movie, Mordo began as an ally, and even friend, of Stephen Strange, but displayed certain issues throughout the movie. In claiming he had conquered his demons, the Ancient One replied, "We never lose our demons. We only learn to live above them."


Mordo’s desire to both protect the natural order and remain steadfastly within its boundaries proved his undoing, with him leaving the sorcerers of Kamar Taj when he found that both the Ancient One and Doctor Strange had bent rules in order to save the world.


I find this relevant when I see seasoned IT pros forcing themselves to operate within constraints that don't exist, except in their own minds. When I hear IT pros proclaim that they would never run (name your operating system, software package, or hardware platform) in their shop, it's usually not for any sound business reason. And when those standards are challenged, I have watched more than a few seasoned veterans break rather than bend. It's not pretty, and it's also not necessary.


There are never too many sorcerers in the world

Mordo's reaction is extreme. He begins hunting down other practitioners of the magical arts and taking their power, proclaiming, "There are too many sorcerers in the world!"


There are times in IT when it feels like EVERYONE is trying to become a (again, fill in your technology or specialty here) expert. And it's true that when a whole crop of new folks come into a discipline, it can be tiresome watching the same mistakes being made, or having to explain the same concepts over and over.


But the truth is that there are never enough sorcerers, or in our case, specialists, in the world. There's plenty of work to go around. And the truth is that not everyone is cut out for some of these specialties, and they soon find themselves overwhelmed and leave – hopefully to find an area of IT that suits them better.


While I don't expect that anyone reading this will magically extract the IT power from their peers, I have watched coworkers shoot down or even sabotage the work of others just so they can maintain their own privileged status. I'm happy to say that this tactic rarely works, and never ends well.


Persistence often pays off

At one point in the movie, the Ancient One sends Strange on a trip through alternate dimensions, then asks, "Have you seen that at a gift shop?" When Strange begs her to teach him, her response is a firm “no.” Hours later, Strange is wailing at the door, begging to be let in.


At some point in your career, you may have an epiphany and realize that your career goals point you toward a certain technology or discipline. And, just your luck, there's a team that specializes in exactly that! So you go to the manager or team lead and ask if you can join up.


Your first request to join the team may fall on deaf ears. And your second. You may need to hang, like a sad puppy dog, around them in the lunchroom or around the water cooler for a while. Unlike Doctor Strange, it may take weeks or even months of persistence, rather than a few hours. But that doesn't mean it's not worth it.


Did you find your own lesson when watching the movie? Discuss it with me in the comments below. And keep an eye out for parts 2-4, coming in the following weeks.

The series is a general interest piece and not related to SolarWinds products in any way, nor will it be used to promote Solarwinds products.


It will be hosted on, the free, open user community for monitoring experts.


Can you

Tomorrow is Thanksgiving here in the USA. I have much to be thankful for but these days I am most thankful that jennebarbour continues to let me write this series each and every week.


So, in that spirit, here's a bunch of links I found on the Intertubz that you may find appetizing, enjoy!


AOL is laying off 500 employees in a restructuring with focus on mobile, data and video

My first thought to this was "AOL still has employees?"


How to eat as much food as humanly possible this Thanksgiving

For those of us in IT that don't already know how to eat way more than necessary, here's a list to help.


Nothing Personal but I'm Taking Your Job

I've said it before, and I will say it again: If you aren't trying to automate your job away, someone else will do it for you.


6 links that will show you what Google knows about you

If you were curious to see yourself as Google sees you.


How To Ask A Question At A Conference

After a busy event season this is a nice reminder on how to be polite when asking questions during a session. Nobody wants to see you strut (HT to datachick).


Live Streaming Web Cam Views from Around the World

I'm wondering how many of these webcams are meant to be public, and how many are simply the result of the owner having no idea.


Eat, Fry, Love

If you haven't seen this video yet, you should. It's a great PSA about the dangers of deep-frying a turkey.


It won't happen this year, but that won't stop me from dreaming about this:



Happy Thanksgiving!

Last week we talked about application-aware monitoring. Rather than placing our focus on the devices and interfaces, we discussed getting data that approximates our users' experiences. These users, are going to be distributed around the organization at least.  They may even be scattered around the Internet, depending on the scope of our application. We need to examine application performance from different perspectives to get a complete picture.

Any way we look at it, we're going to need active remote probes/agents to accomplish what we're looking for. Those should be programmable to emulate application behaviour, so that we can get the most relevant data. At the least, having something that can measure basic network performance from any point on the network is necesary. There are a few options.


Last week, I was invited to Tech Field Day 12 as a delegate and had the opportunity to sit in on the first session of Networking Field Day 13 as a guest. Coincidentally, SolarWinds was the first presenter. Even more coincidentally, they were showing off the NetPath feature of Network Performance Monitor (NPM) 12. This product, while not yet fully programmable to emulate specific applications, provides detailed hop-by-hop analysis from any point at which an agent/probe can be placed. In addition, it maintains a performance history for those times when we get notification of a problem well after the fact. For those of you working with NPM 12, I'm going to recommend you have a very close look at NetPath as a beginning for this sort of monitoring. One downside of the NetPath probes is the requirement to have a Windows Professional computer running at each agent location. This makes it a heavier and more costly option, but well worth it for the information that it provides. Hopefully, the SolarWinds folks will look into lightweight options for the probe side of NetPath in the future. We're only at 1.0, so there's a lot of room for growth and development.

Looking at lighter, though less full-featured options, we have a few. They're mostly roll-your own solutions, but this adds flexibility at the cost of ease.

Lightweight VMs and ARM Appliances

If there's a little bit of room on a VM somewhere, that's enough space for a lightweight VM to be installed. Regular application performance probes can be run from these and report directly to a monitoring station via syslog or SNMP traps. These custom probes can even be controlled remotely by executing them via SSH.

In the absence of VM space, the same sort of thing can be run from a small ARM computer, like a Raspberry Pi. The probe device itself can even be powered by the on-board USB port of another networking device nearby.

Going back to NetPath for a moment, one option for SolarWinds is to leverage Windows Embedded and/or Windows IoT as a lightweight option for NetPath probes. This is something I think would be worth having a look at.

On-device Containers

A few networking companies (Cisco's ISR 4K line, for example) have opened up the ability to run small custom VMs and containers on the device itself. This extends the availability of agents/probes to locations where there are no local compute resources available.

Built-in Router/Switch Functions

Thwack MVP byrona had a brilliant idea with his implementation of IP SLA in Cisco routers and having Orion collect the statistics, presumably via SNMP. This requires no additional hardware and minimal administrative overhead. Just set up the IP SLA process and read the statistics as they're generated.

The Whisper in the Wires

NetPath is looking like a very promising approach to monitoring from different points of view. For most other solutions, we're unfortunately still mostly at the roll-your own stage. Still, we're seeing some promising solutions on the horizon.

What are you doing to get a look at your application performance from around the network?



I wanted to share some of the things I heard and saw during the incredible two days I spent with 300+ attendees at DevOps Days Ohio.


First, I have to admit that after more than a year of attending DevOpsDays around the country, I'm still working on my own definition of what DevOps is, and how it compares and contrasts with some of the more traditional operations. But this event helped gel a number of things for me.


What I realized, with the help of this article (which came out while I was at the conference), is that my lack of clarity is okay, because sometimes the DevOps community is also unclear on what they mean.


One of the ongoing points of confusion for me is the use of words I think I know, but in a context that tells me it means something else. Case in point: configuration management. In my world, that means network device configurations, specifically for backing up, comparing, auditing, and rolling out. But then I hear a pronouncement that, "Config management is code," and, "If you are working on configs, you are a developer now." And most confusingly, "To do config management right, you need to be on Git."

If this has ever struck you as strange, then you (and I) need to recognize that to the DevOps community, the server (and specifically the virtualized server) is king, and the config management they're talking about is the scripted creation of a new server in on-premises or cloud-based environments.


This led to some hilarious interactions for me, including a side conversation where I was talking about on-call emergencies and the other person said, "I don't know why on-call is even a thing any more. I mean, if a system is having a problem, you should just delete it and rebuild it from code, right? Humans don't need to be involved at all."


To which I replied, "Interesting idea, but to my knowledge it's very difficult to delete and re-build a router with a bad WIC using nothing but code."


The reply? "Oh, well, yeah, there's that."


The point of this is not that DevOps-focused IT pros are somehow clueless to the realities of the network, but that their focus is so intensely trained on optimizing the top end of the OSI model, that we monitoring experts need to allow for that, and adjust our dialogue accordingly.


I was honestly blown away to learn how far DevOps culture has made in-roads, even into traditionally risk-averse environments, such as banking. I worked at a bank between 2006 and 2009, right in the middle of the home mortgage crisis, and I could never imagine something like DevOps taking hold. But we heard from folks at Key Bank who spoke openly about the concerns, challenges, and ultimately successes that their shift to DevOps has garnered them, and I saw that the value that cloud, hybrid IT, micro-services, and agile development holds for business that are willing to consider it within the context of their industry, and implement it rationally and thoughtfully.


I was also heartened to hear that monitoring isn't being overlooked. One speaker stated flat out that having monitoring in place is table stakes for rolling out micro-services. This shows an appreciation for the skills we monitoring engineers bring to the table, and presages a potential new avenue for people who simply have monitoring as a bullet item on their to do list to make the leap into a sub-specialization.


There is a lot of work to do, in the form of education, for monitoring specialists and enthusiasts. In one-on-one conversations, as well as in OpenSpace discussions, I found experienced DevOps folks conflating monitoring with alerting; complaining about alerts as noise, while demonstrating a lack of awareness that alerts could be tuned, de-duplicated, or made more sophisticated, and therefore more meaningful; and overlooking the solutions of the past simply because they believed new technology was somehow materially different. Case in point, I asked why monitoring containers was any harder or even different from monitoring LPARs on AIX, and got nervous chuckles from the younger folks, and appreciative belly laughs from some of the old timers in the room.


However, I came to the realization that DevOps does represent a radical departure for monitoring engineers in its "Cattle, not Pets" mentality. When an entire server can be rebuilt in the blink of an eye, the best response to a poorly behaving service is truly not to fix the issue. That attitude alone may take time for those of us who may be mired in biases based in the old days of bare-metal hardware and servers we named after the Brady Bunch or Hobbit dwarves.


Overall, I am excited for the insights that are finally gelling in my mind, and look forward to learning more and becoming a more fluent member of the DevOps community, especially during my upcoming talk at DevOpsDays Tel Aviv!


One final thing: I gave an Ignite talk at this conference and found the format (five minutes, 20 slides that auto-advance every 15 seconds), to be both exhilarating and terrifying. I'm looking forward to my next chance to give one.

Staying one step ahead of hackers trying to infiltrate an IT environment is challenging. It can be nearly impossible if those tasked with protecting that environment don’t have visibility across all of the systems and infrastructure components. Using unified monitoring software gives integrated cross-domain visibility and a solid view of the whole environment.


Let’s take a look at an attack scenario

Perhaps a hacker gains access through a Web application with a structured query language-injection attack against a database server. The attack compromises the database and exfiltrates data or gains credentials.


With access to the local database or server, the attacker can drop malware that could reverse an administrative session and gain access to other parts of the infrastructure, including routers, switches and firewalls. Attack evidence would likely be found in various places within the environment; such evidence might not trigger an alert, but taken together, these events clearly signal a problem.


Visibility leads to quick resolution

With comprehensive monitoring tools, clear insight and consistent education throughout the IT team and all agency personnel, the task can seem less daunting.


The tools

First, make sure monitoring tools are in place to provide deep visibility. These include the following:


  • Endpoints- User device tracking will provide information about where devices are located, how they connect to the network and who uses them.
  • Data- Make sure you have monitoring in place that will detect and block malicious file transfer activities and software designed to securely transfer and track files coming into and going out of the agency.
  • Patching- In large environments, something always needs to be updated. Therefore, it is important to use software that automatically patches servers and workstations.
  • Servers and applications- Always monitor server and application performance. This will help you find service degradation that could indicate an intrusion.
  • Databases- Create performance baselines for databases to ensure that any anomalies are registered.
  • Systems- Deep visibility into virtual machines and storage devices can provide insight into the root cause of any performance change.
  • Networks- Traffic analysis, firewall and router monitoring, and configuration compliance and optimization are all critical to ensuring the integrity of a network.


The knowledge

Once these tools are monitoring what they should, the resulting data needs to be fed into a consolidated view where it can be correlated and analyzed as a whole. Doing so lets IT pros quickly and decisively identify potential threats and take action where needed.


The training

Finally, it is important to make sure that the people who work on the network receive detailed security training. Making everyone aware of the seriousness of an attack and the role each worker plays in practicing good cyber hygiene—from the IT team to finance and public affairs—can go a long way in creating a more secure agency.


There is no one-size-fits-all solution when it comes to security, and attacks are becoming harder to prevent. That said, implementing the right tools, combining insights across domains and providing in-depth, regular training can improve detection and response capabilities.


Find the full article on Signal.

For the last couple years, the single hottest emerging trend in technology, a topic of conversation, the biggest buzzword, and a key criterion for designing both hardware and application bases has been the concept of containers.


At this point, we have approaches from Docker, Google, Kubernetes (k8s), Mesos, and notably, Project Photon from VMware. While discretely, there are differentiations on all fronts, the concept is quite similar. The container, regardless of the flavor, typically contains the packaged, migratable, and complete or component parts of the application. These containers work as workloads in the cloud, and allow for the ability to take that packaged piece and run it practically anywhere.


This is in direct contrast to the idea of virtual machines, which while VM’s can in some ways accomplish the same tasks, but in other ways, they’ve not got the portability to reside as-is on any platform. A VMware based virtual machine can only reside on a VMware host. Likewise Hyper-V, KVM, and OpenStack based VM’s are limited to their native platforms. Now, processes of migrating these VM’s to alternate platforms do exist. But the procedures are somewhat intensive. Ideally, you’d simply place your workload VM’s in their target environment, and keep them there.


That model is necessary in many older types of application workloads. Many more modern environments, however, pursue a more granular and modular approach to application development. These approaches allow for a more MicroServices type of concept. They also allow for the packaging and repackaging of these container based functions, and allow for the deployment to be relocated essentially at will.


In a truly “Cloud-Based” environment, the functionality and orchestration becomes an issue. As the adoption grows, the management of many containers becomes a bit clumsy, or even overwhelming. The tools from Kubernetes (Originally a Google project, then donated to the Cloud Native Computing Foundation) make the management of these “Pods” (basic scheduling units) a bit less of a difficulty. Tools are regularly expanded, and functionality of these tools grows, as part of an opensource code. Some of the benefits to this approach are that the community can access via tools like GitHub, the primitives and add to, optimize, and enhance them, to which these added functionalities are constantly being updated.


Opensource is a crucial piece of the equation. If your organization is not pursuing the agile approach, the “CrowdSourced” model of IT, which in my opinion is closed minded, then this concept is really not for you. But, if you have begun by delivering your code in parts and pieces, then you owe it to yourself to pursue a container approach. Transitions can present their own challenges, but the cool thing is that these new paradigm approaches can be done gradually, the learning curve can be tackled, there is no real outlay for the software, and from a business perspective, the potential beneficial enhancement on the journey to cloud, cloud-native, and agile IT are very real.


Do your research. This isn’t necessarily the correct approach to every IT organization, but it may be for yours. Promote the benefits, get yourself on, and begin learning how your organization can begin to change your methods to approach this approach to IT management. You will not be sorry you did.


  • Some considerations that must be addressed prior to making the decision to move forward:
    Storage – Does your storage environment support containers? In the storage world, Object based is truly important
    • Application – Is your app functional in a micro-services/container based function? Many legacy applications are much too monolithic as to be supportable. Many new DevOps type applications are far more functional



  I’m sure that there are far more considerations.

This is a longer version of a talk I give at conferences and conventions. I would love to hear your responses, thoughts, and reactions in the comments below.


Do You Care About Being Constantly Connected?

For the next few minutes, I dare you to put down your phone, close up your laptop, and set aside your tablet. In fact, I double dog dare you. I've got $20 on the table that says you can't print this, find a quiet corner, and read it, away from any electronic interruptions in the form of beeps, pings, or tweets.


I. Triple. Dog. Dare. You.


Separating ourselves from our devices, and, more broadly, from the internet, which feeds the devices that have become a lifeline for most of us – has been a topic of conversation for some time now. Recently, columnist Andrew Sullivan wrote a column for New York Magazine about coming to terms with his self-described addiction to technology. In "My Distraction Sickness: Technology Almost Killed Me", Sullivan provides sobering data for those of us who spend some (or most) of our days online:

  1. In just one minute, YouTube users upload 400 hours of video
  2. Tinder users swipe profiles over a million times
  3. Facebook users generate 1 billion likes every day
  4. In their regular SnapChatting career, a typical teen will post, receive, or re-post between 10,000 and 400,000 snaps
  5. A study published last year found that participants were using their phones for up to five hours a day…
  6. ... 85 separate times
  7. ... with most interactions lasting fewer than 30 seconds
  8. ... where users thought they picked up their phones half as often as they actually did
  9. Forty-six percent of the study subjects said they couldn't live without their phone


It’s important to recognize that we've arrived at this point in less than a decade. The venerable iPhone, the smartphone that launched a thousand other smartphones, debuted in 2007. Four years late one-third of all Americans owned one. Today, that number is up to two-thirds. If you only count young adults, that figure is closer to eighty-five percent.


This all probably comes as a surprise to no one reading this (likely on a smartphone, tablet, or laptop). Equally un-surprising is that, intellectually, we know what our screens are pulling us away from:




In his essay "7 Important Reasons to Unplug and Find Space", columnist Joshua Becker wrote:


"Life, at its best, is happening right in front of you. These experiences will never repeat themselves. These conversations are unfiltered and authentic."


In that same article, Mr. Becker quotes the progenitor of smartphones the digital patron saint of many IT pros, Steve Jobs, who said,


“We’re born, we live for a brief instant, and we die. It’s been happening for a long time. Technology is not changing it much – if at all.”


But it doesn't stop there. We should already understand what studies are showing:


I want to be clear though: this article is NOT about how bad it is to be connected. It would be disingenuous for me, someone who spends a majority of his day in front of a screen, wrote only about how bad it is to be connected. Not to mention, it wouldn't be particularly helpful.


My goal is to make it clear why disconnecting, at times and for a significant amount of time, is measurably important to each of us and can have a very real impact on the quality of our life, both online and off.


The Secret Society

You've probably read essays suggesting you take a technology cleanse or a data diet, as if the bits and packets of your network have gotten impacted and are now backing up the colon of your brain.


If you have heard such suggestions, you may have responded with, "What kind of crazy wing nut actually does that?"


Now I’d like to share a little secret with you: I belong to a whole group of wing nuts do this every week. We call this crazy idea Shabbat, or the Sabbath in English, and it is observed by Jews across the world.


(image courtesy Yehoshua Sofer)


Before I go any further, you should know that Judaism is not big on converting people so I'm not going to try to get anyone to join the tribe. I'm also not going to ask you to sign up for Amway.


On Shabbat, which begins at sundown Friday night and ends at sundown Saturday, anything with an ON switch is OFF limits. It can't be touched, moved, or changed. Yes, leaving the television set to SportsChannel 24 and just happening to walk past it every 10 minutes is cheating. And no, you don't sit in the dark and eat cold sandwiches. But I’ll talk more about that later.


But Shabbat only comes into play if you are one of the roughly 600,000 Jews in the United States (or 2.2 million worldwide) who fully observe the Sabbath. Which begs the question: if I'm not going to try to get YOU to be Jewish, where am I going with this?


In addition to being part of that crazy group of wing nuts, I've also worked in IT for 30 years. For almost a decade now, I've disconnected every single week, rain or shine, regardless of my job title, company, or on-call rotation. That has given me a unique perspective in tips, tricks, workarounds, and pitfalls.


So this is less of a you-should-unplug lecture, and more of a here’s HOW to unplug and not lose your job (or your marriage, or your mind) conversation.


Remember, this is just part one of a 3-part series. I'm looking forward to hearing your thoughts, suggestions, and ideas in the comments below!

The View From Above: James (CEO)


Another week, another network problem. On Tuesday morning I received an angry call from our CFO, Phyllis, who was visiting our Austin, TX site. The whole network is a mess, she told me, nothing is working properly and I can't do my job. I asked for more detail, but she just said the network was a nightmare and she couldn't even send emails. Great start to the day, especially as Austin is our main manufacturing plant, and if the network was as bad as Phyllis said it was, we were in for a bad week with our supply chain getting out of sync, which could negatively impact both our cashflow and our production output.


I called our new Senior Network Manager, Amanda, to let her know that the Austin office was down. She sounded surprised; apparently she had just been talking to the Inventory Management team, and they had been telling her that they were quite pleased with the performance of the company's inventory tool, especially given that it is based out of our data center in Raleigh, NC. I put her in touch with Phyllis and told her to figure out what was going on, because clearly things in Austin weren't going as great as she thought they were.


The View From The Trenches: Amanda (Sr Network Manager)


Two weeks have passed since I installed Solarwinds' Network Performance Manager, and so far things have been good. I should have guessed that the quiet wouldn't last long, however. I got a call from James around 10AM on Tuesday, and he was mad. Apparently Phyllis was on site in Austin, TX and told him that the network was broken. I knew it wasn't; I was just talking to the Inventory Management team about a project to implement handheld (WiFi) scanners, and they've been testing their old wired scanners in parallel to the WiFi scanners, and both have been working just great, so hopefully both the wireless and wired networks are functioning ok. Still, if Phyllis is upset, it's more than my job's worth to ignore her.


Phyllis is without question good at her job, but I get the impression that she would be happier using a large paper ledger and a pot of ink (and maybe even a feather quill pen). Computers are, in her eyes, an irritation, and trying to troubleshoot her problems over the phone is challenging to say the least. However, after a while I did manage to figure out what the problem was. It turns out that everything is down actually meant my email is working intermittently. About 9 months ago we moved our email to Microsoft's Office365, so the mail servers are now accessed via the internet. I confirmed with Phyllis that she was able to access our intranet without issue, which confirmed that our site network was not the problem, (I knew it!), but when she tried accessing the Internet -- including Outlook365 -- she was having problems. It wasn't a total loss of connectivity, but things were slow, and would sometimes lose her connection to the server altogether. Sounds like an Internet issue, but what - and where?


Time to fire up a browser to NPM. I checked the basics, but all the network hardware seemed fine, including our Internet routers and edge firewalls, so maybe it was something on the Internet itself. Unfortunately I know how these things work; if I can't prove where the problem is, the assumption is still that it's the network at fault. As I stared at the screen, the phone rang; Phyllis was on the line. I don't know why it took so long, she said, but it looks like whatever you did worked. Finally I can get on with my day's work. And she hung up. Had she stayed on the line I'm not sure if I would have admitted that I'd done nothing, but at least the immediate pressure seemed to be off. But what caused the problem? And worse, now the problem had cleared itself up, there aren't really any tests I could do to troubleshoot. At this point, I remembered NetPath.


When I installed NPM, I installed a bunch of probes and set up some monitoring of a number of services to see what it would look like. My idea was that I'd be able to monitor network performance from a few sites, but I got so consumed with setting up device monitoring I pushed that aside for a bit. In the background however, the probes had been faithfully gathering data for me about their connectivity to a number of key sites including -- by incredible good fortune -- the email service. I started off by checking what the NetPath traffic graph looked like right now, when data was successfully flowing to Office365. NetPath had identified that traffic seemed to pass through one of three potential service providers between our Austin site's internet provider and the Office365 servers on the Internet, with the vast majority (around 80%) likely to be sent through TransitCo, a large provider in Texas and the South Central states. At the bottom of the screen was the Path History bar, and it was clear to see that while everything was now green, there was a large chunk of red showing on the timeline for both availability and latency. Time to wind the clock back.


Clicking on one of the red blocks, the NetPath display updated and ... whoa ... ok, that explains it. TransitCo's router was lit up in red (along with some attached links) and NetPath was reporting 90% packet loss through that path, and extremely high latency. No wonder Phyllis was having problems staying connected! Data in hand, I called up TransitCo to ask them about their service interruption and they confirmed that an interface had gone bad but the routing engine had for some reason kept on pumping traffic down that link. They had completed a reboot and an interface replacement around 30 minutes earlier, and service was restored. Amazing. Our own Internet provider wouldn't have reported this as it wasn't their direct problem, and there's no way we could sign up for alerts from every other provider just to keep abreast of the outages. If we hadn't had this tool, I'd still be scratching my head wondering what on earth had happened this morning. Still, while I find out a way to get a better handle on upstream provider problems, at least I can now go back and report on the cause and scope of the outage. And maybe I can sell my VP on funding a secondary Internet link out of Austin from another provider, just in case something like this happens again.


I've not even had it installed for a month, but Solarwinds NPM saved the day (or my reputation, at least). I think I'll be checking out what other products they have.



>>> Continue reading this story in Part 3

A successful help desk seeks to solve incidents quickly, find resolutions to persistent problems, and keep end-users happy. The help desk is the first line of defense triaging tickets and working with end-users directly to fix their technical problems, and this is no easy task.


In order to keep ticket queues low and morale high, help desk managers should consider these three key principles:


1)     Dedicated People

2)     Established Processes

3)     Centralized Information


Dedicated people is the first key principle.


A help desk doesn’t necessarily need senior level engineers with advanced degrees and 10 years’ experience. Instead, a solid first line of defense requires a solid team of hard workers who know how to locate information on internal information repositories and how to Google solutions to weird Windows and printer issues. The key here is hard work and dedication. I don’t mean dedication to showing up on time, necessarily, though that’s certainly important. What I mean is a dedication to getting the issue-at-hand resolved.


For example, during my first year in IT, I worked on a help desk serving a large government agency. We had hundreds of new tickets in the queue every day. My co-worker, Don, made it his simple goal to close as many tickets per week as he could. Don was already in his 30s and had changed careers from restaurant management, so he didn’t have decades of experience along with advanced computer science degrees and industry certifications. What he did have was a sheer determination to figure out an issue and get the problem fixed. Our end-users loved him and often asked for him specifically. He browsed through our internal wikis and Googled his life away looking for a way to fix an issue, and nearly every time he eventually figured it out.


This is what a good help desk needs: people who know how to do basic online research and are dedicated to sticking with an issue until it’s resolved.


Having clear, established processes is the second key principle.


My friend Don would have had a much more difficult time resolving tickets without the processes in place to enable him to get the job done. For example, a service desk manager must determine how tickets will be logged and organized, how they will be triaged, how they will be escalated, and how to provide quick information to help desk technicians to solve new tickets as they come in.

In my experience this means first finding the right ticket management system. Whether it’s in the cloud or on local servers, a solid ticket management system will make it easy for end-users to submit tickets and for the service desk to organize, triage, and resolve them. I personally prefer a single source of truth in which the ticketing system is not only a way to organize tickets but also an information repository and a method to communicate with end-users. In this way technicians can log into one system and find everything they need to get the job done. Navigating multiple systems and many windows is a sure-fire way to forget (or ignore) tickets and spend way too much time looking up simple information such as license keys or asset locations.


Another important part of clear and established help desk processes is accountability. This must be built in to the help desk processes and not just assumed. Tickets get lost, and sometimes they’re ignored. This may be because the help desk is dealing with a huge number of tickets with too few people, but I’ve seen many tickets ignored because they were difficult, long-winded, or because the end-user was a well-known jerk.


Rather than have tickets come in from end-users into a general queue, consider having them all go first to a help desk manager or team lead to very quickly triage and be assigned to the appropriate technician. I have seen struggling service desks go from zero to hero implementing just this one simple process.


A decent ticketing system will have escalation timers, auto-responders, and many other built-in tools to automate workflow, but don’t rely on the software alone to maintain some semblance of order. This is a top-down process beginning with help desk managers and team leads.


Maintaining a centralized, updated information repository is the last key principle.


Let’s face it, most companies use Windows computers for their end-users. Yes, I know there are exceptions, but even Apple devices and various flavors of Linux are not custom-built operating systems that no one has ever heard of. That means many end-user issues are not unique to any one company. What is unique is the company-specific knowledge.


What are the IP addresses of the domain controllers? Where is the installation file for the billing software kept? Does the new branch office use a Windows DHCP server or is it running off their core switch?


Having a centralized repository of information is priceless to a helpdesk technician. Better yet is when the repository is also the ticket management system, and even better yet is when it also contains documentation for how to solve recurring issues or how to install weird company software.


In my first job as a network engineer I worked near the service desk who sat in the next cubicle area. As the number of customers grew, so did the number of technicians, and so did the amount of information needed to resolve tickets. We used a great ticket management system and kept as much information as possible in it. We also used an internal wiki page, but in order to get to it you had to follow a link embedded in the ticketing system.


They were able to support several thousand end-users with a help desk of only three technicians and one service desk manager. So important were these principles that if it anyone discovered that information wasn’t in the database that should have been, whoever was responsible to get it in there had to bring in donuts for the entire office. Yes, I brought donuts in a couple times, and so did our service desk manager and even the owner of the company.


There are volumes that can be written on how to provide successful end-user support. These three principles may be broad, and I’ve seen them implemented in very different ways. However, so long as you have dedicated people, clear processes, and an updated information repository, the help desk will be the successful first line of defense every CIO and Director of IT dreams of. 






Filter Blog

By date:
By tag: