The Private cloud

Posted by arjantim Jul 30, 2016

In a private cloud model, the control of a secure and unique cloud environment to manage your resources is done by your IT department. The difference with public cloud is that the pool of resources is accessible only by you and therefore it makes management much easier and secure.


So, if you require a dedicated resource, based on performance, control, security, compliance or any other business aspect, the private cloud solution might just be the right solution for you.


More and more organisations are looking for the flexibility and scalability of cloud solutions. But many of these organisations struggle with business and regulatory requirements that keep them from being the right candidate for public or private cloud offerings, they think.


It can be that you work within a highly regulated environment that is not suitable for public cloud, and you don't have the internal resources to set up or administer suitable private cloud infrastructure. On the other hand, it might just be that you have specific industry requirements for performance that aren't yet available in the public cloud.


In those cases it could just be that the private cloud as an alternative to the use of public cloud, is a great opportunity. A private cloud enables the IT department, as well as the applications itself, to access IT resources as they are required, while the datacentre itself is running in the background. All services and resources used in a private cloud are defined in systems that are only accessible to the user and are secured towards external access. The private cloud offers many of the advantages of the public cloud but at the same time it minimises the risks. Opposed to many public clouds, the criteria for performance and availability in a private cloud can be customised, and compliance to these criteria can be monitored to ensure that they are achieved.


As a cloud or enterprise architect a couple of things are very important in the cloud era. You should know your application (stack) and the  way it behaves. By knowing what your application needs, you can determine which parts of the application could be placed where, so private or public. A good way to make sure you know your application is using the DART principle:


Discover          -           Show me what is going on

Alert                -           Tell me when it breaks or is going bad

Remediate      -           Fix the problem

Troubleshoot   -           Find the root cause



If  you run the right tools within your environement, it should be easy to discover what is going on in your environment and where certain bottlenecks are, and how your application is behaving and what the requirements for it are, the step to hybrid is much easier to make, but that is for another post, first I'll dive into public cloud a little further next time.

Hybrid IT is used to cover all manners of IT-ism especially those that span services an IT organization is delivering and services being delivered by someone outside of the IT organization. The technology constructs that are present in the current IT state, where services are continually delivered, integrated, and consumed on any device at any given time, are giving rise to hybrid IT adoption. The challenge for IT professionals is to unlock the potential of Hybrid IT without getting caught up in the churn and burn scenario of tech hype and tech debt. IT rigor and discipline must be part of the equation. And this is where monitoring as a discipline comes into play.


At BrightTALK’s Cloud and Virtualization Summit, I presented on monitoring as a discipline as the key to unlocking hybrid IT’s potential. The recording is available to view on BrightTALK’s website and it's hyperlinked below.




Let me know what you think of it in the comment section below.


(Zen Stones by Undeadstawa on DeviantArt)


Over the years, I've observed that despite running multiple element and performance management systems, most organizations still don't truly understand their IT infrastructure. In this post I'll examine how it's possible to have so much information on hand yet still have a large blind spot.




What does discovery mean to you? For most of us I'm guessing that it involves ICMP pings, SNMP community strings, WMI, login credentials and perhaps more in an attempt to find all the manageable devices that make up our infrastructure: servers, hypervisors, storage devices, switches, routers and so forth. We spin up network management software, perhaps a storage manager, virtualization management, performance management, and finally we can sleep safely knowing that we have full visibility and alerting for our compute, storage and networking infrastructure.


At this point I'd argue that the infrastructure discovery is actually only about 50% complete. Why? Because the information gathered so far provides little or no data that can be used to generate a correlation between the elements. By way of an analogy you could say that at this point all of the trees have been identified, labeled and documented, but we've yet to realize that we're standing in the middle of a forest. To explain better, let's look at an example.


Geographical Correlation

Imagine you have a remote site at which we are monitoring servers, storage, printers and network equipment. The site is connected back to the corporate network using a single WAN link, and—horrifyingly—that link is about to die. What do the monitoring systems tell us?


  • Network Management: I lost touch with the edge router and six switches.
  • Storage Management: I lost touch with the storage array.
  • Virtualization Management: I lost touch with these 15 VMs.
  • Performance Management: These elements (big list) are unresponsive.


Who monitors those systems? Do the alerts all appear in the same place, to be viewed by the same person? If not, that's the first issue, as spotting the (perhaps obvious) relationship between these events requires a meat-bag (human) to realize that if storage, compute and network all suddenly go down, there's likely a common cause. If this set of alerts went in different directions, in all likelihood the virtualization team, for example, might not be sure whether their hypervisor went down, a switch died, or something else, and they may waste time investigating all those options in an attempt to access their systems.

Centralize your alert feeds

Suppressing Alerts

If all the alerts are coming into a single place, the next problem is that in all likelihood the router failure event led to the generation of a lot of alerts at the same time. Looking at it holistically, it's pretty obvious that the real alert should be the loss of a WAN link; everything else is a consequence of losing the site's only link to the corporate network. Personally in that situation, I'd ideally like the alert to look like this:


2016/07/28 01:02:03.123 CRITICAL: WAN Node <a.b.c.d> is down. Other affected downstream elements include (list of everything else).


This isn't a new idea by any means; alert suppression based on site association is something that we should all strive to achieve, yet so many of us fail to do so. One of the biggest challenges with alert monitoring is being overwhelmed by a large number of messages, and the signal to noise ratio makes it impossible to see the important information. This is a topic I will come back to, but let's assume it's a necessary evil.

Suppress unnecessary alert noise

Always On The Move

In addition to receiving several hundred alerts from the devices impacted by the WAN failure, now it seems the application team is troubleshooting an issue with the e-commerce servers. The servers themselves seem fine, but the user-facing web site is generating an error when trying to populate shipping costs during the checkout process. For some reason the call to the server calculating shipping costs isn't able to connect, which is odd because it's based in the same datacenter as the web servers.


The security team is called in and begins running a trace on the firewall, only to confirm that the firewall is correctly permitting a session from the e-commerce server to an internal address on port tcp/5432 (postgres).


The network team is called in to find out why the TCP session to shipsrv01.ecomm.myco.corp is not establishing through the firewall, and they confirm that the server doesn't seem to respond to ping. Twenty minutes later, somebody finally notices that the IP returned for shipsrv01.ecomm.myco.corp is not in the local datacenter. Another five minutes later, the new IP is identified as being in the site that just went down; it looks like somebody had moved the VM to a hypervisor in the remote site, presumably by mistake, when trying to balance resources across the servers in the data center. Nobody realized that the e-commerce site had a dependency on a shipping service that was now located in a remote site, so nobody associated the WAN outage with the e-commerce issue. Crazy. How was anybody supposed to have known that?


It seems that despite having all those management systems I'm still a way from having true knowledge of my infrastructure. When I post next, I'll look at some of the things I'd want to do in order to get a better and more holistic view of my network so that I can embrace the inner peace I desire so much.

Just when you thought 2016 couldn't get crazier you wake up to find that Verizon has bought Yahoo and that you are more interested in reading about the drone that delivered a Slurpee. Welcome to my world.


Here are the items I found most amusing from around the Internet. Enjoy!


Verizon to Purchase Yahoo’s Core Business for $4.8 Billion

I'm shocked Yahoo is worth even that much. I'm also hoping that someone will give me $57 million to do nothing.


Canadian Football League Becomes First Pro Football Organization To Use Sideline Video During Games

As our technology advances at an ever increasing pace, and is applied in new situations, it is up to someone in IT to make it all work. It's all about the data folks, as data is the most valuable asset any company (or team) can own.


Nearly Half of All Corporate Data is Out of IT Department’s Control

Honesty, I think that number is much higher. 


GOP delegates suckered into connecting to insecure Wi-Fi hotspots

I am certain the GOP leaders were tech savvy enough not to fall for this trick, right?


Snowden Designs a Device to Warn if Your iPhone’s Radios Are Snitching

Showing what he's been doing with his free time while living in exile, Snowden reveals how our phones have been betraying us for years.


Status Report: 7 'Star Trek' Technologies Under Development

With the release of the new Star Trek movie last week I felt the need to share at least one Star Trek link. But don't get your hopes up for warp drive or transporters anytime soon.


I wanna go fast: HTTPS' massive speed advantage

"If you wanna go fast, serve content over HTTPS using HTTP/2."


Watch The First Slurpee Delivery By Drone

Because who doesn't love a Slurpee in the summertime?


Meanwhile, in Redmond:

a - 1 (3).jpg

turkducken_fire_rz.jpgI need to deep fry a turbaconducken.


This isn't a want, no. This is a primal need of mine.


I feel so strongly about this that it's on my bucket list. It is positioned right below hiring two private investigators to follow each other, and right above building an igloo with the Inuit.


Deep frying a turkey is a dangerous task. You can burn your house down if you are not careful. Why take the risk? Because the end result, a crispy-juicy turkey bathed in hot oil for 45 minutes, is worth the effort. Or so I've been told. Like I said, it's on my bucket list.


Being the good data professional that I am I started planning out how to prepare for the day that I do, indeed, deep fry my own turkey. As I laid out my plans it struck me that there was a lot of similarity between both an exploding turkey and the typical "database is on fire" emergency many of us know all too well.


So here's my list for you to follow for any emergency, from exploding turkeys to databases catching fire and everything in between. You're welcome.


Don't Panic


People who panic are the same people who are not prepared. A little bit of planning and preparation go a long way to helping you avoid "panic mode" in any emergency situation. Whenever I see someone panicking (like ripping out all their network cables just because their mouse isn't working) it is a sure sign that they have little to no practical experience with the situation at hand.


Planning will help you from feeling the need to panic. If your database is on fire you can recover from backups, because you prepared for such a need. And if your turkey explodes you can always go to a restaurant for a meal.


Rely on all your practice and training (you have practiced this before, right)? Emergency response people train in close to real life situations, often. In fact, firefighters even pay people to burn down their spare barns.


Go to your do have a checklist, right? And a process to follow? If not you may find yourself in a pile of rubble, covered in glitter.


Assess the Situation


Since you aren't panicking you are able to calmly assess the situation. A turkey on fire inside your oven would require a different response than a turkey that explodes in a fireball on your deck and is currently burning the side of your house. Likewise, an issue with your database that affects all users will require a different set of troubleshooting steps than an issue affecting only some users or queries.


In order to do a proper assessment of the situation you will be actively gathering data. For database servers you are likely employing some type of monitoring and logging tools. For turkeys, it's likely a thermometer to make certain it has completely thawed before you drop it into the hot oil.


You also need to know your final goal. Perhaps your goal is to stop your house from being engulfed in flames. Perhaps your goal is to get the systems back up and running, even if it means you may have some data loss.


Not every situation is the same. That's why a proper assessment is necessary when dealing with emergencies...and you can't do that while in a panic.


Know Your Options


Your turkey just exploded after you dropped it into a deep fryer. Do you pour water on the fire quickly? Or do you use a fire extinguisher?


Likewise, if you are having an issue with a database server should you just start rebooting it in the hopes that it clears itself up?


After your initial assessment is done you should have a handful of viable options to explore at that point. You need to know the pros and cons for each of these options. That's where the initial planning comes handy, too. Proper planning will reduce panic, allow you to assess the situation, and then you can understand all your viable options along with the pros and cons. See how all this works together?


It may help for you to phone a friend here. Sometimes talking through things can help, especially when the other person has been practicing and helping all along.


Don't Make Things Worse


Pouring water on the grease fire on your deck is going to make the fire spread more quickly. And running 17 different DBCC commands isn't likely to make your database issue any better, either.


Don't be the person that makes things worse. If you are able to calmly assess the situation, and you know your options well, then you should be able to make an informed decision that doesn't make things worse. Also, don’t focus on blame. Now isn't the time to worry about blame. That will come later. If you focus on fault, you aren’t working on putting out the fire right now. You might as well grab a stick and some marshmallows for making s’mores while your house burns to the ground.


Also, a common mistake here is done by people who try to do many things at once, specifically for database issues. If you make multiple changes then you may never know what worked, or the changes you make may cancel each other out leaving you still with a system offline. Know the order of the actions you want to take and do them one at a time.


And it wouldn't hurt you to take a backup now, before you start making changes, if you can.


Learn From Your Mistakes


Everyone makes mistakes, I don't care what their marketing department may tell you. Making mistakes isn't as big of a concern as not learning from your mistakes. If you burned your house down the past two Thanksgivings, don't expect a lot of people showing up for dinner this year.


Document what you’ve done, even if it is just a voice recording.  You might not remember all the details afterwards, so take time to document events while they are still fresh in your memory.


Review the events with others and gather feedback along the way as to how things could have been better or avoided. Be open to criticism, too. There's a chance the blame could be yours. If that's the case, accept that you are human and lay out a training plan that will help you to avoid making the same mistake in the future.


I'm thankful that my database server isn't on fire. But if it was, I know I'd be prepared.



Many agencies are already practicing excellent cyber hygiene; others are still in implementation phases. Regardless of where you are in the process, it is critical to understand that security is not a one-product solution. Having a solid security posture requires a broad range of products, processes and procedures.


Networks, for example, are a critical piece of the security picture; agencies must identify and react to vulnerabilities and threats in real time. You can implement automated, proactive security strategies that will increase network stability and have a profound impact on the efficiency and effectiveness of the overall security of the agency.


How can agencies leverage their networks to enhance security? Below are several practices you can begin to implement today, as well as some areas of caution.


Standardization. Standardizing network infrastructure is an often-overlooked method of enhancing network performance and security.


Start by reviewing all network devices and ensure consistency across the board. Next, make sure you’ve got multiple, well-defined networks. Greater segmentation will provide two benefits: greater security, as access will not necessarily be granted across each unique segment, and greater ability to standardize, as segments can mimic one another to provide enhanced control.


Change management. Good change management practices go a long way toward enhanced security. Specifically, software that requires a minimum of two unique approvals before changes can be implemented can prevent unauthorized changes. In addition, make sure you fully understand the effect changes will have across the infrastructure before granting approval.


Configuration database. It’s important to have a configuration database for backups, disaster recovery, etc. If you have a device failure, being able to recover quickly can be critical; implementing a software setup that can do this automatically can dramatically reduce security risks. Another security advantage of a configuration database is the ability to scan for security-policy compliance.


Compliance awareness. Compliance can be a complicated business. Consider using a tool that automates vulnerability scanning and FISMA/DISA STIG compliance assessments. Even better? A tool that also automatically sends alerts of new risks by tying into the NIST NVD, then checking that information against your own configuration database.


Areas of caution:

Most security holes are related to inattention to infrastructure. In other words, inaction can be a dangerous choice. Some examples are:


Old inventory. Older network devices inherently have outdated security. Invest in a solution that will inventory network devices and include end-of-life and end-of-support information. This also helps forecast costs for new devices before they quit or become a security liability.


Not patching. Patching and patch management is critical to security. Choose an automated patching tool to be sure you’re staying on top of this important task.


Unrestricted bring-your-own-device policies. Allow BYOD, but with restrictions. Separate the unsecure mobile devices on the network and closely monitor bandwidth usage so you can make changes on the fly as necessary.


There is no quick-and-easy solution, but tuning network security through best practices will not only enhance performance, but will also go a long way toward reducing risks and vulnerabilities.


Find the full article on Government Computer News.

In my previous post, I listed some best practices for help desk IT pros to follow to save time resolving issues. The responses I received from that post made me realize that the best solution for one IT organization may not necessarily be the same for another. An organization’s size, business model, functional goals, organizational structure, etc. create unique challenges for those charged with running the help desk function, and these factors directly affect IT support priorities.


With this knowledge in mind, I decided to take a different approach for this post. Below, I have listed some of the easy ways that help desk organizations – irrespective of their differences – can improve their help desk operations through automation to create a chaos-free IT support environment.


  1. Switch to centralized help desk ticketing
    Receiving help desk requests from multiple channels (email, phone, chat, etc.), and manually transferring them onto a spreadsheet creates a dispersed and haphazard help desk environment. Switching to a centralized help desk ticketing system will help you step up your game and automate the inflow of incidents and service requests.
  2. Automate ticket assignment and routing
    Managing help desk operations manually can lead to needless delays in assigning tickets to the right technician, and potential redundancy if you happen to send the same requests to multiple technicians. To avoid this, use a ticketing system that helps you assign tickets to technicians automatically, based on their skill level, location, availability, etc.
  3. Integrate remote support with help desk
    With more people working remotely, traditional help desk technicians have to adapt and begin to resolve issues without face-to-face interactions. Even in office settings, IT pros tend to spend about 30% of their valuable time visiting desks to work on issues. By integrating a remote support tool into your help desk, you can resolve issues remotely, taking care of on- and off-site problems with ease.
  4. Resolve issues remotely without leaving your desk
    A recent survey by TechValidate states that 77% of surveyed help desk technicians feel that using remote support decreased their time-to-resolution of trouble tickets. Using the right remote support tool helps you easily troubleshoot performance issues and resolve complex IT glitches without even leaving your desk.


These are some of the simple yet powerful ways that organizations can create a user-friendly help desk.

Are you managing your help desk the hard way or the easy way?

Four Easy Ways to Create a Chaos-free Help Desk.png


To download this infographic, click here. Share your thoughts on how you reduce workload and simplify help desk support in the comments section."Sore throat from talking, sore feet from walking, sore face from smiling. Must be @CiscoLive."


Before I dig in to what I saw, as well as what I think about what I saw, I have to take a moment to shout out my gratitude and thanks to the amazing SolarWinds team that assembled for this convention. In fact, I had so much shouting to do that I wrote a whole separate post about it that you can read here. I hope you'll take a moment to share in the sheer joy of being part of this team.


But remember to come back here when you're done.


Okay, you're back now? Let's dive in!


CLUS is About Connections

As much as you may think Cisco Live is about networking (the IT kind), technology, trends, and techniques, the reality is that much of the attraction for a show like this is in making personal connections with groups of like-minded folks. While that's true of most conventions, Cisco Live, with 27,000 attendees this year, offers a larger quantity and wider variety of people with the same range of experience, focus, and background as you. You can sit at a table full of 33-year-old voice specialists who started off as Linux® server admins. It might be a bit of a trick to find them in the sea of humanity, socks_kilt.jpgbut they are there. Usually they are attending the same sessions you are; you just have to look around.


Beyond the birds-of-a-feather aspect, Cisco Live gives you a chance to gather with people who share your particular passion - whether it's for a brand, technology, or technique - and learn what's in store in the coming months.


And what would geek-based IT culture be if all of that social interaction didn't include some completely off-the-wall goofiness? An offhand joke last year blossomed into the full-fledged #KiltedMonday, with dozens (if not hundreds) of attendees sporting clan colors and unshaved legs to great effect.


Speaking of legs, many people's legs were also festooned with a riot of colors, as #SocksOfCLUS, also started to take hold. You might even say it got a leg up on the convention this year. (I'll be here all week, folks.)



The Rise of DevNet

During the show, news broke that cloud-based development environment Cloud9 had been acquired by Amazon Web Services® (AWS), which prompted my fellow Head Geek Patrick Hubbard to tweet:

Next, @awscloud grabs @Cloud9. One day all will be #AWS and #Azure. Learn #DevOps fellow geeks.


That truth was already clearly being embraced by the folks at Cisco®.


Over the last three Cisco Live events I've attended, the growth of the area devoted to DevNet - the network engineer-flavored DevOps  - has grown substantially. It’s expanded from a booth or two at Cisco Live 2015, to a whole section of the floor in Berlin 2016, to a huge swath of non-vendor floor space last week. Two dozen workstations were arranged around a model train, and attendees were encouraged to figure out ways to code the environment to change the speed and direction of the train. Fun!


Meanwhile, three separate theaters ran classes every hour on everything from programming best practices to Python deep dive tutorials.


I found this much more engaging and effective than the other statements that network engineers need to learn to code due to the pressure of a SPECIFIC technology (say SDN). While it might be true, I much prefer that the code be presented FIRST, and then let IT pros figure out what cool things we want to do with it.

devnet1.jpg devnet2.jpg devnet3.jpg



Still Figuring IT Out

It is clear that the convention is both emotionally and financially invested in the evolving trends of SDN, IoT, and cloud/hybrid IT. Vast swaths of the show floor are dedicated to it, and to showing off the various ways it might materialize in real, actual data centers and corporate networks.


But the fact is that none of those things are settled yet. Settling, perhaps. Which is fine. I don't need a convention (much less a single vendor) to announce, "And lo, it was good" about any technology that could change the landscape of infrastructures everywhere.


For things like SDN and IoT, Cisco Live is the place you go once a year to check in, see how the narrative has changed, and elbow the person next to you at the session or display and say, "So, are you doing anything with this? No? Me, neither.”


The View from the Booth

Back at Starship SolarWinds (aka our booth), the story was undeniably NetPath™. People sought us out to see it for themselves, or were sent by others (either back at the office or on the show floor) to come check it out. The entire staff demonstrated the belle of the NPM 12 ball constantly throughout the day, until the second day, when we had booth visitors start demo-ing it THEMSELVES to show OTHER visitors (who sometimes turned out to be people they didn't even know). The excitement about NetPath was that infectious.


We also witnessed several interactions where one visitor would assure another that the upgrade was painless. We know that hasn’t always been the case, but seeing customers brag to other customers told us that all of our due diligence on NPM 12, NCM 7.5, NTA 4.2, and SRM 6.3 (all of which came out at the beginning of June) was worth all the effort.


Not that this was the only conversation we had. The new monitoring for stacked switches was the feature many visitors didn't know they couldn't live without, but left texting their staff to schedule the upgrade for. The same goes for Network Insight - the AppStack-like view that gives a holistic perspective on load balancers like F5®s and the pools, pool members, and services they provide.


We also had a fair number of visitors who were eager to see how we could help them solve issues with automated topology mapping, methods for monitoring VDI environments, and techniques to manage the huge volume of trap and syslog that larger networks generate.


And, yes, those are all very network-centric tools, but this is Cisco Live, after all. That said, many of us did our fair share of showing off the server and application side of the house, including SAM, WPM, and the beauty that is the AppStack view.


Even more thrilling for the SolarWinds staff were the people who came back a second time, to tell us they had upgraded THAT NIGHT after visiting us and seeing the new features. They didn’t upgrade to fix bugs, either. They couldn’t live another minute without NetPath, Network Insight (F5 views), switch stack monitoring, NBAR2 support, binary config backup, and more.


We all took this as evidence that this was one of the best releases in SolarWinds history.


CLUS vs SWUG, the Battle Beneath the Deep

In the middle of all the pandemonium, several of us ran off to the Shark Reef to host our first ever mini-SWUG, a scaled-down version of the full-day event we've helped kick off in Columbus, Dallas, Seattle, Atlanta, and, of course, Austin.


Despite the shortened time frame, the group had a chance to get the behind-the-scenes story about NetPath from Chris O'Brien; to find out how to think outside the box when using SolarWinds tools from Destiny Bertucci (giving them a chance to give a hearty SWUG welcome to Destiny in her new role as SolarWinds Head Geek); and hear a detailed description of the impact NetPath has had in an actual corporate environment from guest speaker Chris Goode.


The SolarWinds staff welcomed the chance to have conversations in a space that didn't require top-of-our-lungs shouting, and to have some in-depth and often challenging conversations with folks that had more than a passing interest in monitoring.


And the attendees welcomed the chance to get the inside scoop on our new features, as well as throw out curveballs to the SolarWinds team and see if they could stump us.



Cisco Live was a whirlwind three days of faces, laughter, ah-ha moments, and (SRSLY!) the longest walk from my room to the show floor INSIDE THE SAME HOTEL that I have ever experienced. I returned home completely turned around and unprepared to get back to work.


Which I do not regret even a little. I met so many amazing people, including grizzled veterans who’d earned their healthy skepticism, newcomers who were blown away by what SolarWinds (and the convention) had to offer, and the faces behind the twitter personas who have, over time, become legitimate friends and colleagues. All of that was worth every minute of sleep I lost while I was there.


But despite the hashtag above, of course I have regrets. Nobody can be everywhere at once, and a show like Cisco Live practically requires attendees to achieve a quantum state to catch everything they want to see.

  • I regret not getting out of the booth more.
  • Of course, THEN I'd regret meeting all the amazing people who stopped in to talk about our tools.
  • I regret not catching Josh Kittle's win in his Engineering Deathmatch battle.
  • I regret not making it over to Lauren Friedman's section to record my second Engineers Unplugged session.
  • And I regret not hearing every word of the incredible keynotes.


Some of these regrets I plan to resolve in the future. Others may be an unavoidable result of my lack of mutant super-powers allowing me to split myself into multiple copies. Which is regrettable, but Nightcrawler was always my favorite X-Man, anyway.


#SquadGoals for CLUS17:

Even before we departed, several of us were talking about what we intended to do at (or before) Cisco Live 2017. Top of the list for several of us was to be ready to sit for at least one certification exam at Cisco Live.


Of course, right after that our top goal was to learn how to pace ourselves at the next show.


Somehow I think one of those goals isn't going to make it.




Congratulations! You are our new DBA!


Bad news: You are our new DBA!


I'm betting you got here by being really great at what you do in another part of IT.  Likely you are a fantastic developer. Or data modeler.  Or sysadmin. Or networking guy (okay, maybe not likely you are one of those…but that's another post).  Maybe you knew a bit about databases having worked with data in them, or you knew a bit because you had to install and deploy DBMSs.  Then the regular DBA left. Or he is overwhelmed with exploding databases and needs help. Or got sent to prison (true story for one of my accidental DBA roles). I like to say that the previous DBA "won the lottery" because that's more positive than LEFT THIS WONDERFUL JOB BEHIND FOR A LIFE OF CRIME.  Right?


I love writing about this topic because it's a role I have to play from time to time, too.  I know about designing databases, can I help with installing, managing, and supporting them?  Yes. For a while.


Anyway, now you have a lot of more responsibility than just writing queries or installing Oracle a hundred times a week.  So what sorts of things must a new accidental DBA know is important to being a great data professional?  Most people want to get right in to finally performance tuning all those slow databases, right?  Well, that's not what you should focus on first.


The Minimalist DBA


  1. Inventory: Know what you are supposed to be managing.  Often when I step in the fill this role, I have to support more servers and instances that anyone realized were being used.  I need to know what's out there to understand what I'm going to get a 3 AM call for.  And I want to know that before that 3 AM call. 
  2. Recovery: Know where the backups are, how to get to them, and how to do test restores. You don't want that 3 AM call to result in you having to call others to find out where the backups are. Or to find out that that there are no backups, really.  Or that they actually are backups of the same corrupt database you are trying to fix.  Test that restore process.  Script it.  Test the script.  Often.  I'd likely find one backup and attempt to restore it on my first day of the job.  I want to know about any issues with backups right away.
  3. Monitor and Baseline: You need to know BEFORE 3 AM that a database is having problem. In fact, you just don't want any 3 AM notifications.  The way you do that is by ensuring you know not only what is happening right now, but also what was happening last week and last month.  You'll want to know about performance trends, downtime, deadlocks, slow queries, etc.  You'll want to set up the right types of alerts, too.
  4. Security: Everyone knows that ROI stands for return on investment.  But it also stands for risk of incarceration.  I bet you think your only job is to keep that database humming.  Well, your other job is to keep your CIO out of jail.  And the CEO.  Your job is to love and protect the data.  You'll want to check to see how sensitive data is encrypted, where the keys are managed and how other security features are managed.  You'll want to check to see who and what has access to the data and how that access is implemented.  While you are at it, check to see how the backups are secured.  Then check to see if the databases in Development and Test environments are secured as well.
  5. Write stuff down: I know, I know.  You're thinking "but that's not AGILE!"  Actually, it is.  That inventory you did is something you don't want to have to repeat.  Knowing how to get to backups and how to restore them is not something you want to be tackling at 3 AM.  Even if your shop is a "we just wing it" shop, having just the right amount of modeling and documentation is critical to responding to a crisis.  We need the blueprints more than just to build something. 
  6. Manage expectations: If you are new to being a DBA, you have plenty to learn, plenty of things to put in place, plenty of work to do.  Be certain you have communicated what things need to be done to make sure that you are spending time on the things that make the most sense.  You'll want everyone to love their data and not even have to worry that it won't be accessible or that it will be wrong.


These are the minimal things one needs to do right off the bat.  In my next post, I'll be talking about how to prioritize these and other tasks.  I'd love to hear about what other tasks you think should be the first things to tackle when one has to jump into a an accidental DBA role.

I'm back from the family London vacation and ready to get to work. I was unplugged for much of last week, focusing on the world around me. What I found was a LOT of healthy discussions about #Brexit, David Cameron leaving, and if you can eat a Scotch egg cold (HINT: You can, but you shouldn't, no matter what the clerk at Harrod's tells you.)


With almost 2,000 unread items in my RSS feeder I had lots of material to root through looking for some links to share with you this week. Here are the ones I found most amusing from around the Internet. Enjoy!


PokemonGO and Star Trek TNG's 'The Game'

Happy to see I wasn't the only one who made this connection, but what I'd like next is for someone to make an augmented reality game for finding bottlenecks in the data center.


Security Is from Mars, Application Delivery Is from Venus

I liked the spin this article took on the original book theme. Looking forward to the follow-up post where the author applies the business concepts of cost, benefits, and risk to their marriage.


Microsoft Wins Landmark Email Privacy Case

Reversing a decision from 2014 where the US government thought it was OK to get at data stored outside our borders, this ruling is more in line with current technology advances and, let's be honest, common sense.


How boobytrapped printers have been able to infect Windows PCs for over 20 years

I seem to recall this being a known issue for some time now, so I was shocked to see that the patch was only just released.


6 Workplace Rules that Drive Everyone Crazy

"Underwear must be worn at all times." Good to know, I guess, but is this really a problem for an office somewhere?


Microsoft Apologizes for Inviting "Bae" Interns to Night of "Getting Lit" in Lame Letter

Another public misstep for Microsoft with regards to social values. This letter is more than just poorly worded, it underlines what must be considered acceptable behavior for Microsoft employees.


High traffic hits the operations team

The Olympics are just around the corner!


Wonderful view for King Charles I last week, looking from Trafalgar Square down to Big Ben glowing from the setting sun:

IGNR9895 copy.jpg

As network engineers, administrators, architects, and enthusiasts we are seeing a trend of relatively complicated devices that all strive to provide unparalleled visibility into the inner workings of applications or security. Inherent in these solutions is a level of complexity that challenges network monitoring tools, it seems that in many cases vendors are pitching proprietary tools that are capable of extracting the maximum amount of data out of a specific box. Just this afternoon I sat on a vendor call in which we were doing a technical deep dive of a next-generation firewall with a very robust feature set with a customer. Inevitably the pitch was made to consider a manager of managers that could consolidate all of this data into one location. While valuable in its own right for visibility, this perpetuates the problem of many “single panes of glass”.


I couldn’t help but think, what we really need is the ability to follow certain threads of information across many boxes, regardless of manufacturer—these threads could be things like application performance or flows, security policies, etc. Standards-based protocols and vendors that are open to working with others are ideal as it fosters the creation of ecosystems. Automation and orchestration tools offer this promise, but add on additional layers of intricacy in the requirements of knowing scripting languages, a willingness to work with open source platforms, etc.


Additionally, any time we seem to abstract a layer or simplify it, we lose something in the process—this is known as generation loss. Generation loss is the result of compounding this across many devices or layers of management tends to result in data that is incomplete or worse inaccurate, yet this is the data that we are intending to use to make our decisions.


Is it really too much to ask for simple and accurate? I believe this is where the art of simplicity comes into play. The challenge of creating an environment in which the simple is useful and obtainable requires creativity, attention to detail, and an understanding that no two environments are identical. In creating this environment, it is important to address what exactly will be made simple and by what means. With a clear understanding of the goals in mind, I believe it is possible to achieve these goals, but the decisions on equipment, management systems, vendors, partners, etc. need to be well thought through and the right amount of time and effort must be dedicated to it.

IT pros face a near-constant deluge of trouble tickets every day, which leaves very little time to analyze where the workday actually goes. This post gives you a glimpse into a critical part of a day in the life of a support professional, where service request management and resolution take place.




There are small organizations where ticketing management is pretty much nonexistent. In this instance, the IT admin ends up juggling multiple requests received through various disparate channels (phone, email, chat, in-person requests, etc.), trying to multitask and solve them all. This may sound superhuman, but everyone who’s been there and done that knows it’s extremely time-consuming and difficult. Without a system in place for managing and tracking these service requests, it takes a ridiculous amount of time to simply prioritize and tackle all the tickets at hand. At the end of the day, it’s just grappling with SLA delays, incomplete service fulfilment, dealing with irate customers, and being that overly-stretched IT pro who is lost in a maze of uncategorized ticket anarchy. This, of course, leads to technician and customer dissatisfaction in most cases.


Without proper IT service management processes and techniques in place, hiring additional staff to assist won’t help much. They will still get swamped with tickets and end up having to put out fires all over the place, too.


Where time is lost:

  • Incident management
  • Service request tracking
  • Ticket prioritization and categorization
  • Technician assignment and escalation
  • Communication and follow-up with end-users via multiple channels



Even if you have proper ticketing management practices in place, you could still be wasting time on actual problem resolution if there isn’t technology assistance for the support staff. When handling Level 2 and Level 3 support, all while attending to desktop support requests, IT pros usually lose a lot of time visiting end-user workstations and resolving issues there. This has been the traditional IT way, and it’s fairly simple in small companies with few employees. But, as your network and user base grows, your visit-and-assist method will prove less productive. When support agents aren’t equipped with the right tools for remote troubleshooting, you can definitely expect less productivity, specifically with the number of tickets closed per day.

Instead, consider implementing a self-service system to reduce the more frustrating requests to unlock end-user’s accounts, or reset their passwords. In these cases, even the least tech-savvy user can access and use a centralized knowledge management and self-service portal that empowers him to help himself, freeing the  IT pro to spend time on more important tickets.


Where time is lost:

  • Physically visiting end-user desks to resolve issues.
  • Repeatedly fixing simple and recurring tickets.


Time is definitely of the essence for the support department. Until we understand the value of process and technology, and implement steps to enhance both, we will continue to be burdened with productivity issues (for technicians), satisfaction issues (for customers), or worse, end up wasting time.


Help desk and remote support tools address both these challenges. For smaller organizations where budgetary constraints supplement the issues of time management, simple and cost-effective help desk and remote support tools could help you save deep pocket burns, and optimize time for both effective service management and efficient resolution.

Application performance management continues to be so hard because applications are becoming increasingly complex, relying on multiple third-party components and services that are all added to the already-complicated modern application delivery chain that includes the applications and the backend IT infrastructure that supports them —plus all the software, middleware and extended infrastructure required for performance. This modern application stack, or AppStack, is often managed the old-fashioned way — in silos, without holistic visibility of the entire stack at once.


When the cloud gets added to the mix, none of the traditional AppStack goes away, but it’s no longer managed by the hosting agency in an on-premises datacenter. Nevertheless, IT is still accountable for application performance. So, what can application administrator do to ensure consistent top performance in a hybrid cloud environment? Here are a few things to consider:


Add a step. In a hybrid cloud system, instead of trying to pinpoint where and what the problem is, the first step is quickly and definitively determining who owns the problem -- IT or the cloud vendor.


Get involved in the development phase. Cloud monitoring and management decisions need to be made when the hybrid cloud environment is being created. Don’t get stuck trying to manage applications that weren’t designed for the cloud. By getting involved in the development phase, administrators can ensure they have control over application performance.


Manage cloud and on-premises performance data to determine the root of the application issue. IT needs to determine whether the application performance issue is with the software or the configuration. To achieve full visibility into the performance issue, a monitoring system must be in place to definitively determine where the problem lies.


Plan for the worst-case scenario to prevent it from happening. Early on, administrators should think through and plan for the worst-case scenarios in a hybrid cloud environment to spot problems before they arise and to be prepared should they actually occur. Make sure, for example, that critical systems can failover to standby systems and data is backed-up and archived according to agency policies.


And let’s not forget the network’s role. Application delivery is only as good as the network, and cloud or hybrid apps need to traverse a path across your network to the cloud provider and back. Visibility of the network path can assist in troubleshooting.


Every technological shift comes with a unique set of complexities and challenges – and the need for new understanding. The hybrid cloud is no different. However, gaining visibility into application performance can help agencies reap the benefits of hybrid cloud environments and ensure application performance remains consistently strong.


Find the full article on Government Computer News.

"Sore throat from talking, sore feet from walking, sore face from smiling. Must be @Cisco Live!"


I have a whole post about what I saw at Cisco Live and what I think about what I saw, which will be posted in the next few days, but as I started writing that article, I realized that I had a lot of ink to spill about our amazing SolarWinds team and the experience I had working with them during the convention this year. I realized that it deserves its own dedicated space, which is what this post is about.


I hope you'll take the time to read this preface, share my gratitude, and maybe even leave thoughts of your own in the comments below if you were able to experience any of our shows.


The first thing that plucked at my heartstrings was the booth. Veteran event specialists Cara Prystowski and Helen Stewart executed masterfully, including a completely new (and jaw-droppingly awesome) booth design with the largest team ever assembled. Words cannot do it justice, so here are some pictures:



The team this year boasted an incredible mix of skills and perspectives. We had our usual complement of engineers and geeky types, including NPM Product Managers Chris O'Brien and Kevin Sparenberg; Sales Engineers Sean Martinez, Andrew Adams, Miquel Gonzalez, and David Byrd; Product Marketing Managers Robert Blair and Abigail Norman; and even some of our depth technical leads, including Product Strategist Michal Hrncirik, Principal Architect (and resident booth babe) Karlo Zatylny, and Technical Lead Developer Lan Li.


This was a powerhouse of a team, and there wasn't a single question that people brought to the booth that couldn't be addressed by someone on staff.


Of course, Head Geeks Patrick Hubbard and Destiny Bertucci (not to mention Yours Truly) were there, as well, adding a unique voice and vision for where technology is heading. I can’t stress this enough: Destiny, our newest addition to the Head Geek team, kicked serious geek butt, in the booth, on the trade show floor, and at our SWUG. She is a juggernaut.


But the awesome didn't stop there.


Wendy Abbott was working the crowd like the THWACK rock star she is, helping people understand the value of our online community, as well as the spreading the word about our upcoming THWACKcamp, happening September 14th and 15th. She also split her time between the booth and helping the inimitable Danielle Higgins run our first-ever "mini-SWUG": A SolarWinds User Group for Cisco Live attendees and Las Vegas locals. It was a total blast.


Our Social Media Ninja Allie Eby was also on hand, helping direct visitors and field questions while capturing and promoting the ideas and images swirling around our booth on Twitter, Facebook, and LinkedIn.


Support can be an unsung and under-developed area in any tech organization. But Jennifer Kuvlesky was on hand to show off the new Customer Success Center, helping existing customers and people new to our company understand that we are committed to providing in-depth technical information about our products and the problems they solve.


We even had senior UX designer Tulsi Patel with us, to capture ideas and reactions about our new Orion UI, as well as the features and functions of our products. How's that for responsive? We actually were able to take user feedback from the booth as people were interacting with our products!


Jonathan Pfertner and Caleb Theimer, crack members of our video team, were tirelessly capturing video footage, testimonials, customer reactions, and action shots of the show. Look for some of those images coming soon.


Even our management team was firing on all cylinders. Jenne Barbour (Sr. Director of Marketing and un-official Head Geek wrangler) and Nicole Eversgerd (Sr. Director of Brand and Events) put in as many hours, logged as many carpeted miles, and radiated as many geek-friendly smiles as the rest of the team.


Finally, we had a top notch onsite setup crew. Patrick mentioned that it was the first time he was able to just walk in and turn on the PCs. No small thanks for that goes to Kong and his help provisioning systems last weekend.


It was, as Patrick mentioned on Twitter the "best #CLUS ever.”


Now that you have an appreciation for just how awesome the SolarWinds crew was, sit tight for my thoughts about what I saw this year, and what it means for us as IT pros.

Wth the pace of business today, it’s easy to lose track of what’s going on. It’s also becoming increasingly difficult to derive value from data quickly enough that the data is still relevant. Oftentimes companies struggle with a situation where by the time the data has been crunched and visualized in a meaningful way, the optimal window for taking action has already come and gone.


One of the strategies that many organizations are using to make sense of the vast amounts of helpful data their infrastructure generates is by collecting all the logging information from various infrastructure components, crunching it by correlating time stamps and using heuristics that take relationships between infrastructure entities into account, and presenting it in a report or dashboard that brings the important metrics to the surface.


ELK Stack is one way modern organizations choose to accomplish this. As the name (“stack”) implies, ELK is not actually a tool in itself, but rather a useful combination of three different tools – Elasticsearch, Logstash, and Kibana – hence ELK. All three are open source projects maintained by Elastic. The descriptions of each tool from Elastic on their website are great, so I have opted not to re-write them. Elastic says they are:


  • Elasticsearch: A distributed, open source search and analytics engine, designed for horizontal scalability, reliability, and easy management. It combines the speed of search with the power of analytics via a sophisticated, developer-friendly query language covering structured, unstructured, and time-series data.
  • Logstash: A flexible, open source data collection, enrichment, and transportation pipeline. With connectors to common infrastructure for easy integration, Logstash is designed to efficiently process a growing list of log, event, and unstructured data sources for distribution into a variety of outputs, including Elasticsearch.
  • Kibana: An open source data visualization platform that allows you to interact with your data through stunning, powerful graphics. From histograms to geomaps, Kibana brings your data to life with visuals that can be combined into custom dashboards that help you share insights from your data far and wide.


Put simply, the tools respectively provide fast searching over a large data set, collect and distribute large amounts of log data, and visualize the collected and processed data. Getting started with ELK stack isn’t too difficult, but there are ways that community members have contributed their efforts to make it even easier. Friend of the IT community Larry Smith wrote a really helpful guide to deploying a highly available ELK stack environment that you can use to get going. Given a little bit of determination, you can use Larry’s guide to get a resilient ELK stack deployment running in your lab in an evening after work!


Alternatively, if you’re looking to get going on an enterprise-class deployment of these tools and don’t have time for fooling around, you could consider whether hosted ELK stack services would meet your needs. Depending on your budget and skills, it could make sense to let someone else do the heavy lifting, and that’s where services like Qbox come in. I’ve not used the service myself and I’m not necessarily endorsing this one, but I’ve seen manages services like this one be very successful in meeting other pressing needs in the past.


If you check this out and ELK Stack doesn’t meet your data insight requirements, there are other awesome options as well. There’s also the ongoing debate about proprietary vs. open source software and you’ll find that there are log collection/search/visualization tools for both sides of the matter. If you’re looking for something different, you may want to consider:



In a network far far away on a sunny Friday afternoon something was broken badly and nobody was there to repair it.I was called in as an external consultant to help the local IT to fix the problem. I had never seen that network before and the description of the error was only saying "outage in the network".  When I arrived at the car park , a lot of sad looking duct_taped_fibre.jpgemployees where leaving the office building.

The first thing that I always do in these situations is to ask for the network documentation. There was an uncomfortable long silence after I had asked the question. Finally somebody said: "yeah our documentation is completely outdated and the guy that had known all the details  about the network has just left the company..." The monitoring looked like a F1 race win in Monza, everything was blinking red. The monitoring would really help, but unfortunatly it is also down. When you don´t know what to look for it is like looking for a needle in a haystack. In an outage situation like this, a proper documentation and working monitoring would have helped by reducing the time to find the actual problem. Instead of debugging the actual problem, you spend an enormous amount of time exploring the network. Desperatly trying to find out in which general direction you should do further troubleshooting. You will probably also get side trapped by minor misconfigurations, bad network designs and other odd details. Things that have been there for many years, but are not causing the problem that you try to fix right now. It is hard to figure out  what the actual problem is in these situations. You get also constant pressure  from management during the outage. While you are actually still exploring the network you also have to report what could have caused the outage. To summarize the situation without a valid documentation and a non working monitoring you have some time consuming challenges to bring back the network to live. It is not an ideal position and you should try everything to avoid that.



Lessons to be learned


To have an up-to-date documentation helps a lot. I know to keep the documentation up-to-date is a lot of work.  Many network engineers are constantly fire fighting and have the pressure of rolling out new boxes for a "high priority" project.  It helps to implement the monitoring and documentation into the rollout workflow, so that no device can be added without documentation and monitoring. In my little example from the beginning somebody, was trying to troubleshoot the initial problem and with the "try and error method" disconnected the production virtualization Host on that the monitoring was running. To avoid these situations it makes sense to have the monitoring system on a separate infrastructure which works independently even when there is an outage in the production environment. For the documentation sometimes less is better. You need a solid ground level. For example a good diagram that shows the basic network topology.  Because documentations are outdated in the moment somebody has finished them it is better to look at live data when it is possible. I am always unsure if a maybe outdated documentation is showing the correct switch  types or interfaces. In the monitoring you can be sure that these informations have been live polled and are automatically updated if somebody is making even a minor change like a software update. Some problems can be fixed fast and some are more of a long term effort. For example the mysterious "performance problem". These Tickets circulate around all IT departments and nobody could find anything. Here it helps to layout the complete picture. Find out all the components that are included and their dependencies of each other. This can be a very time consuming job but sometimes it is the only way to figure out what is really causing the "performance problems". With that knowledge integrate into the monitoring you get the live data for the involved systems. I had great success with that method to fix long term problems and have afterwards the capabilities to monitor that issue just in case it will show up again.

Still in London, and enjoying every minute of time well spent with family. Even on vacation I still find time to read the headlines.


So, here is this week's list of things I find amusing from around the Internet. Enjoy!


Savvy Hackers Don't Need Malware

Interesting article on how companies will be incorporating analytics into their security measures which makes me wonder "wait, why weren't you already doing this?"


A.I. Downs Expert Human Fighter Pilot In Dogfight Simulation

Forget self driving cars, we are about to have self flying fighter pilots. I'm certain this won't lead to the end of humanity.


Security insanity: how we keep failing at the basics

Wonderful article by Troy here, showing more examples of basic security failures from some large companies and websites.


The U.S. MiseryMap of Flight Delays

Valuable data for anyone that travels, ever.


Here’s How Much Energy All US Data Centers Consume

It is surprisingly low, IMO. I'd like to see some data on how much capacity and throughput these data centers are providing over time as well.


CRM and Brexit: teachable moment

I enjoyed how the author broke down #Brexit and related it to software. We've all seen both sides of this equation.


Is Your Disaster Recovery Plan Up-To-Date?

I've always found summertime as a good time to review your DR plans, and this article is a nice reminder.


Why yes, Lego Daughter loves Harry Potter, why do you ask?


Most large organizations have IT change management processes sorted. The approval cycle, implementation, backout steps, review and closure form a fine-tuned machine with many moving parts. This is day-to-day stuff. You can’t manage a large infrastructure without having a structured, consistent approach to managing changes.


For others, their change management is more like the Post-it note chaos at the start of The Phoenix Project (once they’d moved past email storms and actually started using post-its). It’s like the crazy traffic around the Champs Elyse in Paris. I’m sure there are some rules in there somewhere and most people avoid hitting each other, but that’s more through good luck than good management.


With either scenario, someone within your organisation is the change initiator. They need to change an infrastructure component or application for a reason that will benefit your organization, whether it’s a new feature, a compliance request, a security patch or a network hardware upgrade. Even if a vendor is supplying the changed component, someone in your company has decided to run with it and implement it.


This isn’t always the case with your Software-as-a-Service (SaaS) products. And that frightens a lot of people. One of the main benefits of SaaS is the fact that you’re always being kept up to date. You’re no longer at the mercy of long release cycles, as the vendor can guarantee that the entire user base had applied the previous release, so everything works for everybody with just one more incremental upgrade. The downside is that you’re at the mercy of the vendor who will release changes whether you want them or not, whenever they want to.


There aren’t many SaaS companies that offer a ‘beta’ approach to their changes. Microsoft’s Office 365 platform does with their update channels. First Release lets you receive the changes before they go mainstream. so you can configure this for your IT team, your test labs and even a pilot group of early adopter power users. The Deferred Channel receives updates every four months instead of monthly. As an aside, they’ve introduced a similar concept to Windows 10 updates, including a Long Term Service Branch that doesn’t see updates for 1-3 years.


Applying this to the real world is going to involve some planning.

  • Who is going to be in the First Release group?
  • When are you going to schedule the testing of these changes?
  • What co-dependencies are there within your organization that will need to be tested against (eg what other systems rely on Word integration)?
  • Does the scheduling of any other changes to any other systems or infrastructure have an impact on the scheduling of your SaaS change testing?
  • Will the scheduled SaaS changes have any impact on other integrations you have with other SaaS products, particularly if there’s a change to their APIs?


If your SaaS provider doesn’t pre-release updates, then you really have to be on your game. Your best bet then is to stay notified of the product roadmap, so you know (as best you can) what’s coming up in the future and when. This might mean monitoring a blog or subscribing to a particular mailing list.


Remember, SaaS changes don’t just impact your systems, they can also impact your people. If an update changes a user interface, you might get some helpdesk calls. Make sure your helpdesk have seen the new interface or ‘what’s new’ splash screens too.


The relationship you have with your vendor is really important here. Hopefully you’ve established your change management needs during the buying process so you’re not caught out after implementation.


From another perspective, maybe your SaaS use isn’t that integral to your infrastructure after all? If it’s a bolt on product like Slack, do you really care what changes they make and when? Again, this assessment of the impact on your organization on future SaaS changes should have been addressed during the buying process. You might actually have assessed this and tagged it as a low risk.


There are a lot of considerations if you are adopting Software as a Service. We’ve explored a few things to watch out for and some tips to make your SaaS adoption a success, from an IT Operations perspective.


I’d love to hear your thoughts on what steps (if any) you’ve taken to include SaaS as part of your change management processes.





I have watched Deadpool about a dozen times now, mostly on airplanes as I travel to various places around the world and Canada. Most of these viewings have been for research purposes. That's right, I *had* to watch Deadpool again and again in order to write this post. You're welcome.


Anyway, I didn't need to see it that many times to know that the writers of Deadpool MUST have had a SysAdmin in mind when they started writing. Don't believe me? Well, let's break it down a bit.


Deadpool is a love story - Much like our careers, we love what we do. Otherwise, we wouldn't be doing it.

Language, please - There is profanity throughout the movie, much like any ordinary day as a SysAdmin.

It's OK to ask for help - There are times when you need to swallow your pride and head over to another cubicle and ask for help.

Sometimes you need to wear your brown pants - Because when it happens, you'll know.

Timelines are confusing - Anyone that has had to answer the question "what happened last night?", or "what changed?" knows this to be true.

Maximum effort - Always needed, and always given.

Ease up on the bedazzling - Simple solutions often work best. There is no need to recreate the wheel, or to rewrite your scripts when a new language hits.

Pizza improves any situation - Having lunch with coworkers, especially those from other teams, is a great way to break through silos.

You can erase stuff written in pencil - When you think you've done something great and won a prize that turns out to be not so great.

Life is an endless series of train wrecks - And DR exercises, planned and unplanned.

'Make it Big' is the album where they earned the exclamation point - Sometimes new releases of software, like SQL 2016, get us excited.

You sound like an infomercial - When vendors email you blindly to sell you solutions you don't need.

We need to subject you to extreme stress - Yeah, much like every day of your SysAdmin life.

Enjoy your weekend - That feeling when you are on call by yourself for the first time.

Your year long plan ends with the wrong guy getting dismembered - When everything goes wrong and you end up in worse shape then when you started.

IKEA doesn't assemble itself - Neither does the network, or an ETL process. It's a team effort.

Time to make the chimichangas - When there is work to be done, and you know it won't be easy.

Writing notes to others - When you leave comments in your code. These are notes to Future You.

I'm totally on top of this - When you are not on top of anything and everything around you is falling apart.

Four or five moments - What it takes to be a hero.

Don't drone on - People will stop listening, or stop reading posts like this one.


Learning to Learn

Posted by SomeClown Jul 12, 2016

In network engineering, and really in all of the information technology world, a plurality of people didn’t learn their craft in the traditional manner. In most professions, we go to school, select a degree, then start at the bottom in the career we have chosen. While that is increasingly becoming the case in IT as well, there are a lot of practitioners who learned their craft on the fly, and have gone on to be successful despite the lack of a formal degree in computer science, or often with no completed degree at all. That is a result of the nascent industry we all saw in the 80’s and onward, as the world of IT really grew and became mainstream in almost every corner of the world. That has, however, brought with it some unique challenges.


One of those challenges is that we all tend to want to learn and study the things we find the most enjoyable about a subject or subjects. As an example in my own life, and this is a bit of a segue to be sure, I have taught myself four different programming languages over the course of my existence on this planet. Yet I find myself skipping the same subjects in each language—repeating the mistakes I have made in the past, carrying them with me into the present. Maybe for me, it’s that I don’t enjoy the basics of mathematics and loops, maybe I jump straight into the more complex topics. We’ll get back to why that is a problem in a second.


So if the free-form learning style of going it alone and deciding how you want to approach learning has its weaknesses, so too do traditional college classes. As an example, while college forces a certain amount of rigor in the process, and foists on us the discipline in the process that we might otherwise lack, the information you learn tends to be out of date by the time you graduate, leaving the matter of a degree merely the barrier to entry into the career world, while not necessarily giving you the current skills you need to be truly competitive. You will have a degree upon which your future endeavors may rely for stability, but you’ll likely have to hit the path of learning on your own nevertheless.


This is not an indictment of either path, only an observation on a couple of differences. Ultimately, if we are in the IT world as a generality, and network engineering in specificity, we have already overcome whatever barriers there might have been in our learning methodologies. Learning is a lifetime skill, and you will never become competitive, nor remain so, without constantly refreshing your knowledge. I would argue that everyone in every endeavor in their lives, whether professional or personal, should accept this… but the subject of personal development is a rich topical area and plenty of books have already been written. I’ll leave that for someone else to take on.


Now that we have established that both of paths that most of us in the field take to learn our craft have flaws, there does that leave us? To answer that, I’ll tell you a story.


A couple of years ago I was brought in as a consultant to help an IT team solve some nagging problems in their network. They had tried for some time to remediate the problems they were facing, but had only succeeded in slightly reducing the impact. The network was unstable, brittle, and management had not only noticed, but had become less and less confident in the leadership and structure of the various departments involved. To troubleshoot the network, and to ultimately solve the problems, would require cooperation from multiple disciplines in IT, but also a new approach.


In my first meeting with the team I listened to the problems they were having, and the impacts on the network and the business. When you remediate network issues you frequently have to operate within very confined maintenance windows, which can be hard to come by. You also put tremendous pressure on your staff and lower the overall quality of your services to the organization as you focus an ever increasingly disparate amount of time on troubleshooting. This eventually leads to things being missed, daily tasks and housekeeping not getting done, and an even greater potential impact on the network just from things being overlooked. So I like to spend some time getting a sense of the room, understanding that just by bringing an outside consultant in, longtime staff often feels threatened.


Next I moved on to asking questions as to what had been done to this point, what had been tried so far. And this is where I really began to see the trouble. The tasks that had been tried were, in some cases great, and in other cases not much more than throwing darts at the wall, wild-eyed guessing at best. There was no structured methodology to the troubleshooting, no documentation of what had been tried, and only a vague sense of what equipment had been changed in the process. Additionally, their logging was spotty at best, and only marginally useful to anyone outside the operations team in the NOC.


Without boring you with additional details, let’s just leave with the fact that I worked with their team to apply some best practices, and we eventually found and remediated the problem. But this illustrates the challenges we face in IT with either of the learning methods we discussed above. We can end up learning outdated information in order to get a degree that is stale as soon as we get it, or we can teach ourselves what we need to know, but end up with sizable, and critical, gaps in our knowledge. Either way can lead to the kind of fundamental mistakes, and lack of a disciplined approach, which contributed to the problems with the network I described.


The way we can overcome this as professionals is to develop a passion for learning. We must constantly strive to not only stay ahead of the industry as a whole, but to better ourselves in our craft. We also have to be honest with ourselves about our own weaknesses, and without letting them negatively drive us, work constantly on incremental improvement. It’s not always a perfect recipe, and we all have setbacks, but if we don’t strive for better, we’ll end up in situations like what I’ve described here. All the logging tools in the world are useless if we don’t know how to use them properly.

Dropping into an SSH session and running esxtop on an ESXi host can be a daunting task!  With well over 300 metrics available, esxtop can throw numbers and percentages at sysadmins all day long – but without completely understanding them they will prove to be quite useless to troubleshooting issues.  Below are a handful of metrics that I find useful when analyzing performance issues with esxtop.




Usage (%USED) - CPU is usually not the bottleneck when it comes to performance issues within VMware but it is is still a good idea to keep an eye on the average usage of both the host and the VMs that reside on it.  High CPU usage levels on a VM may be an indicator of a requirement for more vCPU’s or an sign of something that has gone awry within the OS.  Chronic high CPU usage on the host may indicate the need for more resources in terms of either additional cores or more ESXi hosts needed within the cluster.


Ready (%RDY) - CPU Ready (%RDY) is a very important metric that is brought up in nearly every single blog post dealing with VMware and performance.  To be simply, CPU Ready measures the amount of time that the VM is ready to process on physical CPUs, but is waiting for the ESXi CPU scheduler to find the time to do so.  Normally this is caused by other VMs competing for the same resources.  VMs experiencing a high %RDY will definitely experience some performance implications and may indicate the need for more physical cores, or can sometimes be solved for removing un-needed vCPU’s from VMs that do not require more than one.


Co-Stop (%CSTP) - Similar to ready Co-Stop measures the amount of time the VM was incurring delay due to the ESXi CPU Scheduler – the difference being Co-Stop only applies to those VMs with multiple vCPU’s and %RDY can apply to VMs with a  single vCPU.  A high number of VMs with a high Co-Stop may indicate the need for more physical cores within your ESXi host, too high of a consolidation ration, or quite simply, too many multiple vCPU VMs.




Active (%ACTV) - Just as it’s a good idea to monitor the average CPU usage on both hosts and VMs it’s also the same for active memory.  Although we cannot necessarily use this metric for right sizing due to the the way it is calculated it can be used to see which VMs are actively and aggressively touching memory pages.


Swapping (SWR/s,SWW/s,SWTGT,SWCUR) - Memory swapping is a very important metric to watch.  Essentially if we see this metric anywhere above 0 it means that we are actively swapping out memory pages and processes to the swap file that is create upon VM power on.  This means instead of paging memory to RAM, we are using much slower disk to do so.  If we see swap occurring we may be in the market for more memory on our physical hosts, or looking to migrate certain VMs to other hosts with free physical RAM.


Balloon (MEMCTLGT) - Ballooning isn’t necessarily a bad metric for memory consumption but can definitely be used as an early warning symptom for swapping.  When a value is reported for ballooning it basically states that the host cannot satisfy the VMs memory requirements, and is essentially reclaiming unused memory back from other virtual machines.  Once we are through reclaiming memory from the balloon driver then swapping is the next logical step, which can be very very detrimental on performance.




Latency (DAVG, GAVG, KAVG, QAVG) - When it comes to monitoring disk i/o latency is king.  Within a virtualized environment there are many different areas where latency may occur though, from leaving the VM, going through the VMkernel, HBA, and storage array.  To help understand total latency we can look at the following metrics.

  • KAVG – This the amount of time that the I/O spends within the VMkernel
  • QAVG – This is the amount to time that the I/O spends in the HBA driver after leaving the VMkernel
  • DAVG – This is the amount of time the I/O takes to leave the HBA, get to the storage array and return back.
  • GAVG- We can think of GAVG (Guest Average) as the sum of all three metrics (KAVG, QAVG, DAVG) – essentially the total amount of latency as it pertains to the applications within the VM.


As you might be able to determine a high QAVG/KAVG can most certainly be a result of too small of a queue depth within your HBA – that or possibly your host is way too busy and VMs need to be migrated to others.  A high DAVG (>20ms) normally indicates an issue with the actual storage array, either it is incorrectly configured and/or too busy to handle the load.




Dropped packets (DRPTX/DRPRX) - As far as network performance there are only a couple of metrics in which we can monitor from a host level.  The DRPTX/RX monitor the packets which are dropped either on the transmit or receive end respectively.  When we begin to see this metric go above 1 we may come to the conclusion that we have very high network utilization and may need to either increase our bandwidth out of the host, or possible somewhere along the path the packets are taking.


As I mentioned earlier there are over 300 metrics within esxtop – the above are simply the core ones I use when troubleshooting performance.  Certainly having a third party monitoring solution can help  you to baseline your environment and utilize these stats to more to your advantage by summarizing them in more visually appealing ways.  For this week I’d love to hear about some of your real life situations -   When was there a time where you noticed a metric was “out of whack” and what did you do to fix it?  What are some of your favorite performance metrics that you watch and why?   Do you use esxtop or do you have a favorite third-party solution you like to utilize?


Thanks for reading!

Historically, cyber security methods closely mirrored physical security – focused primarily on the perimeter and preventing access from the outside. As threats advanced, both have added layers, requiring access credentials or permission to access rooms and systems, and additional defensive layers continued to be added for further protection.


However, the assumption is that everything is accessible; it’s assumed that no layer is secure and that, at some point, an intruder will get in—or is already in. What does this mean for the federal IT pro? Does it mean traditional security models are insufficient?


On the contrary; it means that as attacks – and attackers – get more sophisticated, traditional security models become one piece of a far greater security strategy made-up of processes and tools that provide layers to enhance their agency’s security posture.


A layered approach


Agencies must satisfy federal compliance requirements, and the Risk Management Framework (RMF) was created to help. That said, meeting federal compliance does not mean you’re 100 percent secure; it’s simply one—critical—layer.


The next series of layers that federal IT pros should consider are those involved in network operations. Change monitoring, alerting, backups and rollbacks are useful, as are configuration management tools.


A network configuration management tool will help you create a standard, compliant configuration and deploy then across your agency. In fact, a good tool will let you create templates.


Automation is key and a configuration management tool will help you keep up with changes automatically; it will let you change your configuration template based on new NIST NVD recommendations and get those changes out quickly to ensure all devices maintain compliance.


In addition to a network configuration tool, federal IT pros should consider layering in the following tools to enhance security:


Patch management. Patch management is critical to ensuring all software is up to date, and all vulnerabilities covered. Look for a patch management tool that is automated and supports custom applications, as many agencies have unique needs and unique applications.


Traffic analysis. A traffic analyzer will tell you, at any given time, who is talking to whom, who is using which IP address, and who is sending what to whom. This is vital information. Particularly in the case of a threat, where you need to conduct forensics, a traffic analysis tool is your best weapon.


Security information and event management. Log and event management tools brings all the other pieces together to allow federal IT pros to see the entire environment—the bigger picture—to correlate information and make connections to see threats that may not have been visible before.


The ideal solution is to build on what you already have; use what works and keep adding. Create layers of security within every crevice of your environment. The more you can enhance your visibility, the more you know, the harder it will be for attackers to get through and the greater your chances of dramatically reducing risk will be.


Find the full article on Defense Systems.

Talk about disruptive innovation. Pokemon GO, has put Nintendo back on the map as it has skyrocketed into the stratosphere of top mobile apps in just a week's time. The mobile game is a venture between Nintendo and Niantic, Inc - an independent entity in the Alphabet set of companies. It launched with much fanfare last week and generated critical mass and out-of-the-stratosphere velocity in terms of user adoption and game play. And here in lies the IT aspect.


There were noticeable hiccups in Quality-of-Service (QoS) such that even though it was a Top Charts game, it had a 3.5 star rating in the Google Play Store. Essentially, the launch and overwhelming response created a situation where the elastic supply of cloud powered by Niantic's Alphabet parent company could not meet the demands of the rabidly active user base. An example was the experience of our very own James Honey, who was attempting to create an account for his youngest. He went through the web portal and created an account when it didn't timeout. It sent him an URL to verify his information to create his child's account, which also timed out. So he clicked on the customer support button, which sent an email but mean-time-to-resolution is 48-hours. Interestingly enough, he's still waiting for the verification and approval of the account for his child after 80+ hours from when he first started with the app download.


In closing, it circles back to things that IT pros already know all too well: (1) No matter how much planning and preparation goes into a production launch, it happens and IT pros have to remediate it efficiently and effectively; (2) Hybrid IT is reality as the app lifecycle is now spanning the developer's platform, running across distributed systems in the cloud(s), and is being consumed on someone's local mobile platform; and (3) the rate of change and scale of that change is ever growing over time and yet IT pros still have to deliver the CIO's SLA.


Do you think those three things are apropos? Let me know below in the comment section.

Cisco Live! begins in just a few days. In fact, even as you read this, my colleague, Cara Prystowski, is winging her way northwest from Austin to begin the process of setting up our brand new booth. (While I can't share pictures with you yet, trust me, it is a thing of wonder and beauty and I can't wait to see people's faces when they lay eyes on it for the first time.) Following close behind her is Head Geek™ Patrick Hubbard (, to make sure that the 16 demo systems are all up and running so that we can show the latest and greatest that SolarWinds has to offer.


But Cisco Live (or #CLUS, as it's known on social media) is about more than a bunch of vendors hawking their wares on the trade show floor. Here's what I'll be looking forward to:


First and foremost, YOU!! If you are coming to #CLUS, please stop by booth 1419 and say hello. We'll all be there, and the best part of each day is meeting people who share our passion for all things monitoring (regardless of which tools you use.).


For me, personally, that also means connecting with my fellow Ohio-based networking nerds. We even have our own hashtag: #CLUSOH, and I expect to tirelessly track them down like the Pink Panther detective namesake.


#CLUS also is the first time we can introduce a familiar face with a new role. Destiny Bertucci ( is a veteran at SolarWinds (she's employee #13), a veteran of our convention circuit, and our newest Head Geek!! Destiny is uniquely and eminently qualified to be part of the Head Geek team, and all of SolarWinds is excited to see what comes from this next chapter in her career.


So with 3 Head Geeks, not to mention the rest of our amazing staff in-booth (all technical, not a salesperson in sight!!) I am excited to tell our story, and share all the amazing new features in NPM 12, not to mention NCM, NTA, SRM, and the rest of the lineup.


As mentioned earlier, our new booth is amazing. It features multiple demo stations, two video walls, and a vivid LED-infused design that underscores the SolarWinds style and focus. For those of us in the booth, it's functional and comfortable. For folks visiting us, it's eye-catching and distinctive.


Along with the new design comes new buttons, stickers, and convention swag. This includes SolarWinds branded socks. YES, SOCKS!! There is an underground #SocksOfCLUS conversation on Twitter, and I am proud to say we will be representing with the best of them. Meanwhile, the buttons and stickers that have become a sought-after collectible at these shows feature all new messages.


Any convention would be a waste of time if one didn't hit at least a few talks, seminars, and keynotes. While much of my time is committed to being in the booth, I'm looking forward to attending "Architecture of Network Management Tools,” and "Enterprise IPv6 Deployment,” among other sessions.


Of course, I would be remiss if I didn't mention the SolarWinds session! Patrick and I will be presenting "Hybrid IT: The Technology Nobody Asked For" on Tuesday at 4:30pm in the Think Tank. The response so far has been fantastic, but there is still room available. We hope you will join us if you are in the neighborhood.


Despite our heavy commitment to the booth and our personal growth, all three of the Head Geeks will be carving out time to stop by Lauren Friedman's section of the Cisco© area to film some segments for Engineers Unplugged. Because whiteboards and unicorns!


Finally, and I almost hate to mentio this because it's already sold out, so it's kind of a tease (but I will anyway): we're hosting our first convention-based (mini) SolarWInds User Group (SWUG) down at the Mandalay Bay Shark Reef. As always, SWUGS are a great way for us to meet our most passionate customers, but more than that, it's a way for customers within a particular area to meet each other and share ideas, brainstorm problems, and build community beyond the electronic boundaries of THWACK.


Obviously, there will be more, including social media scavenger hunts and Kilted Monday. But this should give you a sense of the range and scale of the Cisco Live experience. If you can't make it this year, I suggest you start saving and/or pestering your managers to help you make it out next time.


You know that we'll be there waiting to see you!

Recently, when hearing of the AWS outage due to weather in the Sydney data center, I began thinking about High Availability (HA), and the whole concept of “Build for Failure.” It made me wonder about the true meaning of HA. In the case of AWS, as Ben Kepes correctly stated on a recent Speaking in Tech podcast, a second data center in, for example, Melbourne would have had implemented a failover capacity which would have alleviated a high degree of consternation.


The following is a multi-level conversation about High Availability, so I thought that I’d break it up into some sections: Server level, storage level and cloud data center level.


Remember, Fault Tolerance (FT) is not HA. FT means that the application being hosted remains at 100% uptime, regardless of the types of faults experienced. HA means that the application can endure a fault on some levels with rapid recovery from downtime, including potentially little to no downtime. FT, particularly in networking and virtual environments, involves a mirrored device always sitting in standby mode, actively receiving simultaneous changes to the app, storage, etc., which will take over should the primary device encounter a fault of some sort.


Server level HA, which is certainly the oldest IT segment into which we’ve been struggling, has been addressed in a number of ways. Initially, when we realized that a single server was never going to resolve the requirement and typically this referred to a mission critical app or database), we decided clustering would be our first approach. By building systems where a pair (or a larger number), of servers built as tandem devices would enhance uptime, and grant a level of stability to the application being serviced, we’d addressed some of the initial issues on platform vulnerable to patching, and other kinds of downtime.


Issues in a basic cluster had to do with things like high availability in the storage, networking failover from a pair to a single host, etc. For example, what would happen in a scenario in which a pair of servers were tied together in a cluster, each with their own storage (internal) and one went down? If a single host went down unexpectedly, there would be the potential for issues with the storage becoming “Out of sync” and potential data-loss would ensue. This “Split Brain” is precisely what we’re hoping to avoid. If you lose consistency in your transactional database, often times, a rebuild can fix, but of course take precious resources away from day-to-day operations, or even worse, there could be unrecoverable data loss, which can only be repaired with a restore. Assuming that the restore is flawless, how many transactions, and/or how much time was lost during the restore and from what recovery point were the backups made? So many potential losses here.  Microsoft introduced the “Quorum Drive” concept into their Clustering software, which offered up the ability to avoid “Split Brain” data, and ensured some cache coherency into an X86 SQL cluster, and that helped quite a bit, but still didn’t really resolve the issue.


To me, there’s no wonder that so many applications that could have easily been placed onto X86 platforms had so much time pass prior to that taking place. Mainframes, and robust Unix systems which do cost much to maintain, and stand up, had so much more viability in the enterprise, particularly on mission critical, and high transaction apps. Note that there are of course, other clustering tools, for example Veritas Cluster manager which made the job of clustering within any application cluster a more consistent, and actually quite a bit more robust process.


Along comes virtualization on the X86 level. Clustering happened in its own way, HA was achieved through tasks like Distributed Resource Scheduling, and as the data sat typically on shared disc, the consistency within the data could be ensured. We were also afforded a far more robust way in which to stand up much larger and more discrete applications, with tasks like more processor, adding disc, and memory requiring no more than a reboot of individual virtual machines within the cluster that made up the application.


This was by no means a panacea, but for the first time, we’d been given the ability to address inherent stability issues on X86. The flexibility of vMotion allowed for the backing infrastructure to handle the higher availability of the VM within the app cluster itself, literally removed the sheer reliance of the internal cluster on hardware in network, compute, and storage. Initially, the quorum drive which needed to be a raw device mapping in VMWare, disappeared, thus making pure Microsoft SQL clusters to be more difficult, but as versions of vSphere moved on, these Microsoft clusters became truly viable.


Again, VMWare has an ability to support a Fault Tolerant environment, for truly mission critical applications. There are specific requirements in FT, along the lines of doubling the storage onto a different storage volume, doubling the CPU/Memory and VM count on a different host, as these involve mirrored devices whereas HA doesn’t actually follow that paradigm.


In my next posting, I plan to address Storage as it relates to HA, storage methodologies, replication, etc.

This edition comes to you from London where I am on vacation. I'm hopeful that #Brexit won't cause any issues getting in or out of the UK, or any issues while we are playing tourist. And I'm certainly not going to let a vacation get in the way of delivering the latest links for you to consume.


So, here is this week's list of things I find amusing from around the Internet. Enjoy!


Ransomware takes it to the next level: Javascript

Another article on ransomware which means another set of backups I must take, and so should you.


When the tickets just keep coming

Yeah, it's a lot like that.


IT must ditch ‘Ministry of No’ image to tackle shadow IT

OMG I never thought about this angle but now I need t-shirts that say "Ministry of NO" on them.


Happy 60th Birthday Interstate Highway System! We Need More Big-Bang Projects Like You

Once upon a time I was a huge fan of everything interstate. A huge 'roadgeek', I couldn't get enough knowledge about things like wrong road numbers, weird exit signs, and ghost roads like the Lincoln Highway and Route 66. Happy Birthday!


Oracle Loses $3 Billion Verdict for Ditching HP Itanium Chip

In related news, Larry Ellison has cut back on the number of islands he is thinking of buying this year.


Microsoft pays woman $10,000 over forced Windows 10 upgrade

Now *everyone* is going to expect money in exchange for not understanding the need for backups of your business critical data.


Apple Granted Patent For Phone Camera Disabling Device

If this technology comes to market it is going to wind up in the courts, which makes me think it was invented by lawyers at Apple.


Looking forward to more sightseeing with Lego Daughter the next two weeks:



The public sector frequently provides services and information via websites, and it’s important that these websites are up and running properly. And that’s not just for citizen-facing websites. Federal IT managers face the same challenge with internal sites such as intranets and back-end resource sites.


So what can federal IT pros do to keep ahead of the challenge, catch critical issues before they impact the user, and keep external and internal sites running at optimal performance?


The answer is three-fold:


  1. Monitor key performance metrics on the back-end infrastructure that supports the website.
  2. Track customer experience and front-end performance from the outside.
  3. Integrate back- and front-end information to get a complete picture.


Performance monitoring


Federal IT pros understand the advantages of standard performance monitoring, but monitoring in real time is just not enough. To truly optimize internal and external site performance, the key is to have performance information in advance.


This advance information is best gained by establishing a baseline, then comparing activity to that standard. With a baseline in place, a system can be configured to provide alerts based on information that strays from the baseline. And troubleshooting can start immediately and the root cause can be uncovered before it impacts customers. By anticipating an impending usage spike that will push capacity limits, the IT team can be proactive and avoid a slowdown.


That historical baseline will also help allocate resources more accurately and enable capacity planning. Capacity planning analysis lets IT managers configure the system to send an alert based on historical analysis.


Automation is also a critical piece of performance monitoring. If the site goes down over the weekend, automated tools can restart the site if it crashes and send an alert when it’s back up so the team can start troubleshooting.


End-user experience monitoring


Understanding the customer experience is a critical piece of ensuring optimal site performance. Let’s say the back-end performance looks good, but calls are coming in from end-users that the site is slow. Ideally, IT staff would be able to mimic a user’s experience, from wherever that user is located, anywhere around the world. This allows the team to isolate the issue to a specific location.


It is important to note that federal IT pros face a unique challenge in monitoring the end-user experience. Many monitoring tools are cloud based, and therefore will not work within a firewall. If this is the case, be sure to find something that works inside the firewall that will monitor internal and external sites equally.


Data integration


The ultimate objective is to bring all this information together to provide the visibility across the front- and back-end alike, to know where to start looking for any anomaly, no matter where it originates.


The goal is to improve visibility in order to optimize performance. The more data IT pros can muster, the greater their power to optimize performance and provide customers with the optimal experience.


Find the full article on Government Computer News.

With monitoring, we try to achieve end to end visibility for our services. So everything that is running for business critical applications needs to be watched . For the usual suspects like switches, servers and firewalls we have great success with that. But in all environments you have these black spots on the map that nobody is taking care of. There are two main  categories why something is not monitored, the organisational (not my department) and the technical.




Not my Department Problem

In IT sometimes the different departments are only looking after the devices that they are responsible for. Nobody has established a view over the complete infrastructure. That silo mentality ends up with a lot of finger pointing and ticket ping pong. Even more problematic are devices that are under the control of a 3rd party vendor or non IT people. For example, the power supply of a building is the responsibility of the facility management. In the mindset of the facility management monitoring has a completly different meaning to the one we have in IT. We have build up fully redundant infrastructures. We have put a lot of money and effort into making sure that every device has a redundant power supply. Only to find that it ends up in a single power cord that is going to a single diesel power generator that was build in the 1950s. The monitoring by the facility management is to go to the generator two times per day and take a look at the front panel of the machine.




And than you have the technical problems that can be a reason why something is not monitored. Here are some examples why it is sometimes hard to implement monitoring from a technical perspective. Ancient devices: Like the mentioned Diesel Power generator there are old devices that come from an era without any connectors that can be used for monitoring. Or it is a very old Unix or Host machine. I have found all sorts of tech that was still important for a specific task. So when it couldn´t be decommissioned it is still a dependency for a needed application or task. If it is still that important than we have to find a way to monitor it. It is needed to find a way to connect like we do with SNMP or an agent. If the devices simply support none of this connections we can try to watch the service that is delivered through the device or implement an extra sensor that can be monitored. For example of the Power generator, maybe we can not watch the generator directly but we can insert some devices like an UPS that can be watched over SNMP and shows the current power output. With intelligent PDU in every rack you can achieve even more granularity on the power consumption of your components. Often all the components of a rack have been changed nearly every two years, but the Rack and the power connector strip have been used for 10+ years. The same is true for the cooling systems. There are additional sensor bars available that feed your monitoring with data for the case the cooling plant can not deliver these data. With a good monitoring you can react before something happens.




Another case are passive technologies like CWDM/DWDM or antennas. These also can only be monitored indirectly with other components that are capable of proper monitoring. With GBICs that have an active measurement / DDI interface you have access to real time data that can be implemented into the monitoring. Once you have this data in your monitoring you have a baseline and know how the damping across your CWDM/DWDM fibres should look like. As a final thought, try and take a step back to figure out what is needed so that your services can run. Think in all directions and expect nothing as given. Include everything that you can think of from climate, power and include all dependancy of storage, network and applications. And with that in mind take a look at the monitoring and check if you cover everything.

In previous posts, I've talked about the importance of having a network of trusted advisors. I've also discussed the importance of honing your DART-SOAR skills. Now I'd like us to explore one of those soft and cloudy topics that every IT professional deals with, but is reluctant to address directly. And that is the business of breaking down personal silos of inefficiency, particularly as it pertains to IT knowledge and expertise.


As an IT professional, I tend to put all the pressure of knowing and doing everything on myself, aka Team Me. I've been rewarded for this behavior, but it has also proven to be ineffective at times. This is because the incentives could influence me to not seek help from anyone outside the sphere of me.


The majority of my struggle was trust-related. The thought that discussing something I knew nothing or little about would be a sign of weakness. Oh, how naïve my green, professional self was. This modus operandi did harm to me, my team, and my organization because its inefficiencies created friction where there didn’t need to be any.


It wasn’t until I turned Me into We that I started truly owning IT. By believing in its core tenet and putting it into practice, it opened doors to new communities, industry friends, and opportunities. I was breaking down silos by overcoming the restrictions that I placed on myself. I was breaking my mold, learning cool new stuff, and making meaningful connections with colleagues who eventually became friends.


It reminds me of my WoW days. I loved playing a rogue and being able to pick off opponents in PvP Battlegrounds. But I had to pick my battles, because though I could DPS the life out of you, I didn’t have the skills to self-heal over time, or tank for very long. So engagements had to be fast and furious. It wasn't until I started running in a team with two druids (a tank and a healer), that we could really start to own our PvP competition. My PvP teammates also played rogues and shared their tips and tricks, which included Rogue communities with game play strategies. As a result, I really learned how to optimize my DPS and my other unique set of skills toward any given goal.


Do you stand on the IT front and try to win alone? Have you found winning more gratifying when you win as a team? Let me know in the comment section below.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.