Great Scott! Round 2 was full of surprise upsets and matches that were almost too close to call! Bracket Battle 2017 continues to be wildly unpredictable—this might come down to a photo finish folks!


Let’s see how our sidekicks did in this round:



  • Round 2: Groot vs Dr. Emmett Brown  There was a lot of support for Doc Brown in the comment section, but it was Groot who won this round by a splinter! amore1329 explained, “Well, inventing the Flux Capacitor was impressive, but we are talking about an intergalactic space plant, with anger issues and an extremely limited vocabulary...who just happened to save his entire team from certain death and then came back as a cute dancing tree who has great taste in music.”

  • Round 2: Shaggy Rogers vs Garth Algar This match-up had us all on the edges of our seats! Shaggy had the early lead then Garth pulled ahead and then Shaggy again. From downtown with the buzzer shot it was... Garth FTW! It all came down to 5 votes!
  • Round 2: Pinky vs Donkey  Pinky proved to us all that it’s not how big you are, it’s how big you play. Donkey will be okay though, he informed us that he’s going to wake up tomorrow and make waffles. sparda963 totally predicted the outcome of this match, “I think this one will be a nail-biter for sure! I had to give it to Pinky though. If it wasn't for him, Brain would have taken over the world easily! So Pinky saved us all over and over again.”



  • Round 2: Hermione Granger vs Bucky Barnes  tinmann0715 called this one early on, “ I predict blowout. Seriously... however close this ends up it shouldn't even have been that close.” familyofcrowes summed up this match in 3 words “magic beats bionics....”
  • Round 2: Wilson vs Dr. Watson  Wilson really dropped the ball this round and was voted off the island. ecklerwr1 commented, “Whew I can't even believe that the volley ball won the first round. Thank goodness Dr. Watson has the ball literally "on the ropes" lololol!”
  • Round 2: Samwise vs Gromit  Samwise manages to stay in this battle because he knows how to be a loyal team player.
  • Round 2: Chewbacca vs Tonto  danielv said what we were all thinking, “Expected this to be a landslide - not surprised by the results!”
  • Round 2: Ford Prefect vs Agent K  This was a tough loss for all the true geeks out there. Good effort guys. Ford lost by almost 42% mcam wrote, “true geeks would be Ford all the way K was cool just because of TLJ and no other reason. Ford was part of the team that got us 42 - I mean, come on now”


Were you surprised by any of the outcomes for this round? Comment below!


It’s time to check out the updated bracket & start voting for the ‘Honorable’ round! We need your help & input as we get one step closer to crowning the ultimate sidekick!


Access the bracket and make your picks HERE>>

.Spring is here! Well, for some of us, I guess. It's still cold and rainy here, that's why I'm looking forward to going somewhere warm next week: Telford! I will be SQLBits next week delivering a full day training session with datachick as well as our popular Database Design Throwdown. If you are near Telford...ok, *no one* is near Telford...so if you are able to get to Telford, stop and say hello, we'd love to talk data with you.


As always, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!


The Fictiv Open Source Motorcycle

I'm not certain how I feel about building a motorcycle out of 3D printed parts, but I will admit to being intrigued about the possibilities that 3D printing will bring to everyone.


2038: only 21 years away

For those of us that remember the money we made with Y2K "issues", here's a reminder that time runs out on Linux systems in 21 short years.


Why American Farmers Are Hacking Their Tractors With Ukrainian Firmware

I'm certain this will end well. What's the worst that can happen?p>


Secret colours of the Commodore 64

I had the Commodore VIC-20, my friends had a 64. None of us thought about stuff like this.


Dishwasher has directory traversal bug

One day I hope we find a way to make manufacturers accountable for such awful security practices.


Senate Puts ISP Profits Over Your Privacy

The crazy things our elected officials do with our privacy, and with our beliefs and expectations that we have any.


Laptops, tablets and other gadgets banned from cabin on some US-bound flights

As someone who travels frequently, this has me concerned. I'm not likely to travel without my laptop, and I don't trust to put my laptop into my checked bag.


Looking forward to another great SQL Bits event next week:


By Joe Kim, SolarWinds Chief Technology Officer


I recently presented at our Federal User Group meeting in Washington, DC and hybrid IT was the hottest topic at the event.  It reminded me of a blog written last year by our former Chief Information Officer, Joel Dolisy, which I think is still very valid and worth a read.



It’s really no surprise that agencies are embracing hybrid clouds given that federal IT is balancing the need for greater agility, flexibility, and innovation, with strict security control. Hybrid clouds provide the perfect alternative, because they allow agencies to become more nimble and efficient while still maintaining control.


In today’s IT, there’s no room for barriers. Things exist in many places; applications, for example, must be stateless, mobile, and easily scalable to accommodate periods of peak demand.


Hybrid clouds offer three specific benefits that on-premises or hosted solutions cannot offer.


Lockdown Security


The hybrid model can help alleviate agencies’ cloud security concerns. Agencies can opt to keep extremely sensitive data on-premises in private clouds, while using public clouds to run applications.


That said, some agencies have begun placing contractor-owned and operated cloud offerings directly onto their networks. The contractors are providing physical security and boundary protection based on agency requirements. This helps solve acquisition challenges and allows agencies to more readily adopt innovative new commercial technologies.


Better Disaster Recovery


Systems may simply go down due to events that are beyond anyone’s control – power outages, hurricanes, and other phenomena. These situations require disaster recovery programs, and hybrid clouds can play a part in their implementation.


Hybrid clouds make disaster recovery far easier to implement and manage, and far more cost-effective. First, there’s no intensive physical installation, because everything is software-defined. Second, the hybrid cloud model can be more financially beneficial to an organization, especially agencies experiencing tight budgetary purse strings.


Greater Efficiency


With big data, processing time can sometimes take weeks, which may as well be a lifetime. Integrating existing on-premises computers with off-site hosted cloud resources can shave that time down to minutes.


Hybrid architectures make this possible. An organization can spin up hundreds of extra processors as necessary, for a specified period of time. This can help ensure that applications remain fully accessible and functioning, and speed up the dissemination of critical information.


There are other types of savings that can be had, specifically in terms of space. Data center consolidation has been in full swing since 2010, when the Federal Data Center Consolidation Initiative (FDCCI) was first introduced. A hybrid approach can help agencies in this effort by allowing them to save an enormous amount of space currently dedicated to compute resources that can be virtualized.


There are many benefits to hybrid cloud deployments, but they still have to be monitored closely. I wrote about the need for network monitoring, and the same rule of thumb can be applied to the cloud. Administrators must monitor their servers and applications – preferably with an agentless tool – that exist within both their on-premises and hosted environments. While hybrid clouds do offer the best of both worlds, agencies will want to always make sure the workloads they’re running within those worlds remain fully optimized. It’s also important to note that hybrid cloud deployment should be the goal, not just a transitional state.


Find the full article on our partner DLT’s blog, Technically Speaking.



Holy round one upset Batman! Our fifth annual bracket battle has already proven to be the most unpredictable we have ever seen. Maybe it’s the wide range of characters who span decades of pop-culture, but the community seemed pretty torn on most match-ups in the “noteworthy” round.


Let’s take a look at how our sidekicks fared in this round:



  • Robin Vs Groot: For many, Robin is the quintessential sidekick, however, he was no match for Groot who gained fame in the Guardians of the Galaxy. ecklerwr1 offered one possible explanation for this huge upset: “I think this is a little age related to be honest. Many younger users probably didn't watch all the batman and robin movies and cartoons.”

  • Goose Vs Chewbacca:  No surprises in this match-up, the Wookiee warrior easily takes down the Top Gun Wingman. ajmalloy “Goose drew the worst possible first round opponent”

  • Samwise Vs Tina Ruth Belcher:  Easily one of the biggest shutouts this round! It appears that loyalty and sacrifice were more valued attributes in a sidekick than an obsession with zombies and working hard at the family restaurant. tallyrich “Samwise - that's a sidekick that will do what it takes.”

  • Shaggy Rogers Vs Morty Smith KMSigma explains how this was a loyalty vote for him (and everyone else apparently) “Sorry man - loyalty wins.  I do love me some Morty and can't wait for the next season, but we have, what, 20 episodes of R&M and have had Norville "Shaggy" Rogers since 1969?  And originally voiced by Casey Kasem?  Sorry, but Summer's stuttering little brother doesn't compare for me.”

  • Willow Rosenberg Vs Dr. Emmett Brown: Great Scott! This was no contest. The time traveling mad scientist easily beat out the vampire-slaying sidekick.

  • Rick Jones Vs Hermione Granger: Hermione used her magic to run away with the win. Maybe Rick would have had better luck if he were a more loyal sidekick. tinmann0715 “Points lost from Rick because he was a sidekick to so many different characters. In my eyes a sidekick is to be loyal.”

  • Bob, Agent Of Hydra Vs. Bucky Barnes: Neither sidekick seemed very popular in this match-up, nevertheless Bucky managed to win this round in a landslide.

  • Genie Vs. Dr. Watson: Magic lamps & wishes were surprisingly no match for the mystery solving sidekick, Dr Watson.

  • Pinky Vs Barney Rubble: chippershredder summed up this match-up perfectly “So Brain, What are we going to do today?” “Same thing we do every day, Pinky.  Use SolarWinds to take over the world!”

  • Barney Fife Vs Agent K: Unfortunately, the deputy sheriff of Mayberry was no match for the MIB.

  • Ford Prefect Vs Keenser: Ford Prefect FTW in this intergalactic shutout!



  • Jiminy Cricket Vs. Wilson: This match-up came down to the wire, but in the end it was the Castaway companion who came out ahead. For a match-up between a cricket and a volleyball, it was decidedly heated. jeremymayfield “How can a bloody volleyball be beating the legend, an icon?  Its just not right people.   The ball didn't even make it through the entire movie, he washed away. Jiminy appears in many more cartoons over the years.”

  • Dwight Schrute Vs. Gromit: Choosing between a goofy office sidekick and a life-saving dog was tough for everyone. In the end Gromit came out ahead. silverwolf “Gromit definitely! He saves everyone waaaayyyy tooo many times.

  • Bullwinkle Vs Donkey: Another battle of the generations, but the voters who grew up with Shrek decided the winner in this match-up!

  • Luigi Vs Garth Algar: Garth is on a winning streak! First, he beats Spongbob and now Luigi! Mamma Mia!

  • Tonto Vs Short Round: Without a doubt the closest match-up of this round. ecklerwr1 was all of us when he said “Wow can't believe this is close... Tonto FTW!”


Were you surprised by any of the shutouts or nail bitters for this round? Comment below!


It’s time to check out the updated bracket & start voting for the ‘Admirable’ round! We need your help & input as we get one step closer to crowning the ultimate sidekick!


Access the bracket and make your picks HERE>>

Troubleshooting efficiency and effectiveness are core to uncovering the root cause of incidents and bad events in any data center environment. In my previous post about the troubleshooting radius and the IT seagull, troubleshooting efficacy is the key performance indicator in fixing it fast. But troubleshooting is an avenue that IT pros dare not to walk too often for fear of being blamed for being incompetent or incorrect.


We still need to be right a lot more than we are wrong. Our profession does not give quarters when things go wrong. The blame game anyone? When I joined IT operations many a years ago, one of my first mentors gave me some sage advice from his own IT journey. It’s similar to the three envelope CEO story that many IT pros have heard before.

  1. When you run into your first major (if you can’t solve it, you’ll be fired) problem, open the first envelope. The first envelope’s message is easy – blame your predecessor.
  2. When you run into the second major problem, open the second envelope. Its message is simply – reorganize i.e. change something whether it’s your role or your team.
  3. When you run into the third major problem, open the third envelope. Its message is to prepare three envelopes for your successor because you’re changing company willingly or unwillingly.  


A lifetime of troubleshooting comes with its ups and downs. Looking back, it has provided many an opportunity to change my career trajectory. For instance, troubleshooting the lack of performance boost from a technology invented by the number one global software vendor almost cost me my job; but it also re-defined me as a professional. I learned to stand up for myself professionally. As Agent Carter states, "Compromise where you can. And where you can’t, don’t. Even if everyone is telling you that something wrong is something right, even if the whole world is telling you to move. It is your duty to plant yourself like a tree, look them in the eye and say, no. You move." And I was right.


It’s interesting to look back, examine the events and associated time-series data to see how close to the root cause signal I got before being mired in the noise or vice-versa. The root cause of troubleshooting this IT career is one that I’m addicted to, whether it’s the change and the opportunity or all the gains through all the pains.


Share your career stories and how troubleshooting mishap or gold brought you shame or fame below in the comment section.



On Day Zero of being a DBA I inherited a homegrown monitoring system. It didn't do much, but it did what was needed. Over time we modified it to suit our needs. Eventually we got to the point where we integrated with OpsMgr to automate the collection and deployment of monitoring alerts and code to our database servers. It was awesome.


The experience of building and maintaining my own homegrown system combined with working for software vendors has taught me that every successful monitoring platform needs to have five essential components; identify, collect, share, integrate, and govern. Let's break down what each of those mean.



A necessary first step is to identify the data and metrics you want to monitor and alert upon. I would start this process by looking at a metric and putting it into one of two classes: informational or actionable. Metrics that were classified as information were the metrics that I wanted to track, but didn't need to be alerted upon. Actionable are the metrics where I needed to be alerted upon because I was needed to perform some actions in response. For more details on how to identify what metrics you want to collect, check out the Monitoring 101 guide, and look for the Monitoring 201 guide coming soon.



After you identify the metrics you want, you need to decide how you want to collect and store them for later use. This is where flexibility becomes important. Your collection mechanism needs to be able to consume data in varying formats. If you build a system the relies on data being in a perfect state, you will find yourself easily frustrated the first time some imperfect data is loaded. You will also find yourself spending far too much time playing the role of data janitor.



Now that your data is being collected, you will want to share it with others, especially when you want to help provide some details about specific issues and incidents. As much as you might love the look of raw data and decimal points, chances are that other people will want to see something prettier. And there's a good chance they will want to be able to export data in a variety of formats, too. More than 80% of the time your end-users will be fine with the ability to export to CSV format.



With your system humming along, collecting data, you are going to find that other groups will want that data. It could also be the case that you need to feed your data into other monitoring systems. Designing a system that can integrate well with other systems requires a lot of flexibility. It's best that you think about this now, before you build anything, as opposed to trying to make round pegs fit in a square hole later. And it doesn't have to be perfect for every possible case, just focus on the major integration method used the world over that I already mentioned: CSV.



This is the component that is overlooked most often. Once a system is up and running, very few people consider the task of data governance. It's important that you take the time to define what the metrics are and where they came from. Anyone consuming your data will need this information, as well. And if you change a collection, you need to communicate the changes and the possible impacts they may have for anyone downstream.


When you put those five components together you have the foundation for a solid monitoring application. I'd even surmise that these five components would serve any application well, regardless of purpose.

March madness is upon us! I'm not just talking about the NCAA tournament here in the United States, I'm also talking about the Bracket Battle 2017 here on THWACK®. As a former basketball player and coach, I love March. Not only do we get treated to some of the best basketball games, but spring arrives to help us forget all about the winter months.


Anyway, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!


Netflix is launching a simplified rating system to improve its suggestions

This sounds great until you realize you use your profile to watch movies with your kids, and that's why you keep seeing "Goosebumps" in your recommendations.


GitHub - bcicen/ctop: Top-like interface for container metrics

Not a post, but a link to GitHub where you can find a project on "ctop," which is running the top command, but for containers. Yes, I find this fascinating.


Star Trek Ransomware Boldly Encrypts

It was only a matter of time before the ransomware folks got around to using Star Trek in some cute attempt to make their crimes seem less horrible.


Password Rules Are BS

A nice reminder about the disservice we've done to ourselves with password rules.


The most detailed maps of the world will be for cars, not humans

The amount of data behind autonomous cars is staggering. Here's hoping we don't have to wait for them much longer.


Thieves are pickpocketing wallet apps in China

Yet another reason to not enjoy QR codes: they are now being used for theft. It won't be long before this crime becomes common here, I expect. This is also the time to remind you that QR codes kill kittens.


From a walk in Brussels a few years ago, I think something got lost in translation:


By Joe Kim, SolarWinds Chief Technology Officer


We all know that network monitoring is absolutely essential for government IT pros to help ensure IT operations are running at optimal performance. That said, with so many tools available, it’s tempting to monitor everything. But be careful: monitoring everything can quickly turn into too much of a good thing.


Having an excessive number of monitoring tools and alerts can result in conflicting metrics and data, overly complex systems, and significant management challenges all working together to undermine an administrators’ ability to accurately identify true network problems.


Understanding why, and for whom, systems are monitored, will help IT pros implement the needed tools and be the most useful for enhancing agency IT operations.


The Importance of Monitoring


Remember, monitoring is critical. The cost of downtime alone makes monitoring operational metrics a necessity. In fact, the value of monitoring is sometimes the driver for “over-monitoring." Some IT pros may think, “The more tools I have, the more insight I get.”


The countless number and type of monitoring tools available have increased from monitoring bandwidth, security systems, servers, code management, and implementation metrics, all the way to high-level operational metrics.


Unfortunately, most of these tools work independently, and agencies will patch several tools together -- each providing different metrics -- to create a massive monitoring system. With this complex system, monitoring becomes a task in and of itself, taking up IT pros’ valuable time instead of providing a seamless foundation of accurate and actionable monitoring data.


Agencies must make smart decisions to remain nimble and keep pace, and that means avoiding mammoth, costly monitoring systems. Solutions that neatly aggregate an agency’s preferred metrics deliver better availability, security, and performance.


Find an ideal monitoring solution by evaluating the response to two questions:


For whom am I monitoring? Are metrics more important to the operations engineer, the project manager, or agency management? There may be a wide array of monitoring needs, even within the engineering contingent. Determine in advance your monitoring “customer.”


What metrics do I really need? What is required to keep things running smoothly, without drowning in alerts and data? Too many alerts and too much data is a frighteningly common problem. Even worse, investing in a separate tool for each is costly and inefficient.


In a nutshell, agencies should identify the most valuable audience and metrics to avoid the need for multiple tools.


Focus on the Data


Remember, the point of monitoring is to inform operational decisions based on collected data. This should be the point that drives monitoring decisions, and the reason to consider investing in a comprehensive monitoring tool.


With an increasing demand for a more digital government, maintaining insights into the infrastructure and application level of the IT operations within the agency is critical. Focusing on the audience and the agency’s specific needs will help ensure a streamlined monitoring solution that that helps drive mission success.


Find the full article on Government Computer News.



They’re usually not the ones in the spotlight, although some steal the show.

They’re not the captains; they don’t decide which way to go.

They’re the unsung heroes, the ones in the shadows.

They’ve got your back when you’re battling your foes.

Sidekick is their title; they don’t need to gloat.

One will be the winner after you cast your vote!

Bracket battle is back and bigger than ever.

By the time we’re done, we’ll crown the best sidekick once and forever!


Starting today, 33 of the most popular sidekicks will battle it out until only one remains and reigns supreme as the ultimate sidekick. We’ve handpicked a wide range of sidekicks from TV, movies, comics, and video games to make this one of the most diverse bracket battles yet. The starting categories are as follows:

  • What are we going to do tonight, Brain?
  • You can be my wingman any time.
  • Holy ____, Batman!
  • Let your conscience be your guide.


We picked the starting point and initial match ups; however, just like in bracket battles past, it will be up to the community to decide who they would want as their partner in crime.



Bracket battle rules:

Match up analysis:

  • For each sidekick, we’ve provided reference links to wiki pages—to access these, just click on their NAME on the bracket
  • A breakdown of each match up is available by clicking on the VOTE link
  • Anyone can view the bracket and match ups, but in order to vote or comment, you must have a THWACK® account and be logged in



  • Again, you must be logged in to vote and trash talk
  • You may vote ONCE for each match up
  • Once you vote on a match, click the link to return to the bracket and vote on the next match up in the series
  • Each vote earns you 50 THWACK points! If you vote on every match up in the bracket battle, you can earn up to 1,550 points!



  • Please feel free to campaign for your favorite sidekicks and debate the match ups via the comment section (also, feel free to post pictures of bracket predictions on social media)
  • To join the conversation on social media, use hashtag #SWBracketBattle
  • There is a PDF printable version of the bracket available, so you can track the progress of your favorite picks



  • Bracket Release is TODAY, March 20th
  • Voting for each round will begin at 10 a.m. CDT
  • Voting for each round will close at 11:59 p.m. CDT on the date listed on the bracket home page
  • Play-in battle opens TODAY, March 20th
  • Round 1 OPENS March 22nd
  • Round 2 OPENS March 27th
  • Round 3 OPENS March 30th
  • Round 4 OPENS April 3rd
  • Round 5 OPENS April 6th
  • Ultimate sidekick announced April 12th


If you have any other questions, please feel free to comment below and we’ll get back to you!

Which one of these sidekicks would you want as your copilot?

We’ll let the votes decide!


Access the bracket overview HERE>>


When I head out to conventions, especially the bigger ones like Cisco Live, I always expect to find some darling technology that has captured imaginations and become the newest entry in every booth denizen's buzzword bingo word list. And most of the time, my expectation is grounded in experience. From SDN to IoT, and on through cloud, container, and BaaS Blah Blah as a Service (BaaS), each trend is heralded with great fanfare, touted with much gusto, and explained with significant confusion or equivocation.


Not this year.


Chalk it up to the influence of Berlin's tasty beer and solid work ethic if you want, but this year the crowd was clearly interested in "the work of the work," as I like to call it, or "less hat, more cattle," as my friends in Austin might phrase it.


Don't get me wrong. The sessions were engaging as ever. The vendor floor was packed. The attendees came early and stayed each day to the end. The DevNet area was bigger than ever before. It was, by every measure, a great conference.


More about DevNet: While there were a lot of younger faces, there was no shortage of folks who clearly had put their years in. Patrick was the first to notice it, and it's worth highlighting. Folks with depth skills in a technical area were taking time to begin training on the "new thing," a set of skills that are up-and-coming, which do not match, in any way, the techniques they use right now, which may not even bear a resemblance to their current job. But they were there, session after session, soaking it in and enjoying it.


But as I commented to patrick.hubbard and ding, I hadn't yet found "it." And they both pointed out that sometimes the "it" is simply thousands of people spending time and money to come together and share knowledge, build connections, and enjoy the company of others who know what they know and do what they do.


Meanwhile, a steady stream of visitors came to the SolarWinds booth asking detailed questions and waiting for answers. Sometimes they had several questions. Often, they wanted to see demos on more than one product. They ooh'ed and ah'ed over our new showstoppers like NetPath and PerfStack (more on those in a minute), but stuck around to dig into IPAM, VNQM, LEM, and the rest.


After speaking to someone for a few minutes, visitors were less apt to say, "Can I have my T-shirt now?" and more likely to say, "I would also like to see how you can do ______." For a company that staffs its trade show booths with an "engineers-only" sensibility, it was deeply rewarding.


But there was no "trendy" thing people came asking about. There simply was no buzz at this show.


Unless - and I'm just throwing it out there - it was US.


You see, about a month before Cisco Live, SolarWinds was identified as the global market share leader for network management software (read about it here: http://www.solarwinds.com/company/press-releases/solarwinds-recognized-as-market-leader-in-network-management-software). Now that's a pretty big deal for us back at the office, but would it matter to in-the-trenches IT pros?


It mattered.


They came to the booth asking about it. To be honest, it was a little weird. Granted, a kind of weird I could get used to, but still weird.


So it turns out that Cisco Live didn't feature a buzz-worthy technology, but instead we found out that we got to be the belle of the ball.


PostScript: Next year, Cisco Live will be in Barcelona, Spain. Espero ver tu alli y hablar con tu en ingles y español.


As we come to the end of this series on infrastructure and application data analytics, I thought I'd share my favorite quotes, thoughts, and images from the past few weeks of posts leading up to the PerfStack release.


SomeClown leads the way in The One Where We Abstract a Thing


"Mean time to innocence (MTTI) is a somewhat tongue-in-cheek metric in IT shops these days, referring to the amount of time it takes an engineer to prove that the domain for which they have responsibility is not, in fact, the cause of whatever problem is being investigated. In order to quantify an assessment of innocence you need information, documentation that the problem is not yours, even if you cannot say with any certainty who does own the problem. To do this, you need a tool which can generate impersonal, authoritative proof you can stand on, and which other engineers will respect. This is certainly helped if a system-wide tool, trusted by all parties, is a major contributor to this documentation."


Karen:  Mean Time To Innocence! I'm so stealing that. I wrote a bit about this effect in my post Improving your Diagnostic and Troubleshooting Skills. When there's a major problem, the first thing most of us think is, "PLEASE DON'T LET IT BE ME!"  So I love this thought.


demitassenz wrote in PerfStack for Multi-dimensional Performance Troubleshooting


"My favorite part was adding multiple different performance counters from the different layers of infrastructure to a single screen. This is where I had the Excel flashback, only here the consolidation is done programmatically. No need for me to make sure the time series match up. I loved that the performance graphs were re-drawing in real-time as new counters were added. Even better was that the re-draw was fast enough that counters could be added on the off chance that they were relevant. When they are not relevant, they can simply be removed. The hours I wasted building Excel graphs translate into minutes of building a PerfStack workspace."


Karen:  OMG! I had completely forgotten my days of downloading CSVs or other outputs of tools and trying to correlate them in Excel. As a data professional, I'm happy that we now have a way to quickly and dynamically bring metrics together to make data tell the story it wants to tell.


cobrien  NPM 12.1 Sneak Peek - Using Perfstack for Networks


"I was exploring some of the data the other day. It’s like the scientific method in real-time. Observe some data, come up with a hypothesis, drag on related data to prove or disprove your hypothesis, rinse, and repeat."


Karen:  Data + Science.  What's not to love?


SomeClown mentioned in Perfstack Changes the Game


"PerfStack can now create dashboards on the fly, filled with all of the pertinent pieces of data needed to remediate a problem. More than that, however, they can give another user that same dashboard, who can then add their own bits and bobs. You are effectively building up a grouping of monitoring inputs consisting of cross-platform data points, making troubleshooting across silos seamless in a way that it has never been before."


Karen: In my posts, I focused a lot on the importance of collaboration for troubleshooting. Here, Teren gets right to the point. We can collaboratively build analytics based on our own expertise to get right to the point of what we are trying to resolve.  And we have data to back it up.


aLTeReGo in a post demo-ing how it works, Drag & Drop Answers to Your Toughest IT Questions


"Sharing is caring. The most powerful PerfStack feature of all is the ability to collaborate with others within your IT organization; breaking down the silo walls and allowing teams to triage and troubleshoot problems across functional areas. Anything built in PerfStack is sharable. The only requirement is that the individual you're sharing with has the ability to login to the Orion web interface. Sharing is as simple as copying the URL in your browser and pasting it into email, IM, or even a help desk ticket."


Karen: Yes! I also wrote about how important collaboration is to getting problems solved fast.


demitassenz shared in Passing the Blame Like a Boss


"One thing to keep in mind is that collaborative troubleshooting is more productive than playing help desk ticket ping pong. It definitely helps the process to have experts across the disciplines working together in real time. It helps both with resolving the problem at hand and with future problems. Often each team can learn a little of the other team’s specialization to better understand the overall environment. Another underappreciated aspect is that it helps people to understand that the other teams are not complete idiots. To understand that each specialization has its own issues and complexity.


Karen: Help desk ticket ping pong. If you've ever suffered through this, especially when someone passes the tick back to you right before the emergency "why haven't we fixed this yet" meeting with the CEO, you'll know the pain of it all.


SomeClown observed in More PerfStack - Screenshot Edition


"In a nutshell, what it allows you to do is to find all sorts of bits of information that you're already monitoring, and view it all in one place for easy consumption. Rather than going from this page to that, one IT discipline-domain to another, or ticket to ticket, PerfStack gives you more freedom to mix and match, to see only the bits pertinent to the problem at hand, whether those are in the VOIP systems, wireless, applications, or network. Who would have thought that would be useful, and why haven't we thought of that before?"


Karen: "Why haven't we thought of that before?" That last bit hit home for me. I remember working on a project for a client to do a data model about IT systems. This was at least 20 years ago. We were going to build an integrated IT management systems so that admins could break through the silo-based systems and approaches to solve a major SLA issue for our end-users. We did a lot of work until the project was deferred when a legislative change meant that all resources needed to be redirected to meet those requirements. But I still remember how difficult it was going to be to pull all this data together. With PerfStack, we aren't building a new collection system.  We are applying analytics on top of what we are already collecting with specialized tools.


DataChick's Thoughts


This next part is cheating a bit, because the quotes are from my own posts. But hey, I also like them and want to focus on them again.


datachick in Better Metrics. Better Data. Better Analytics. Better IT.


"As a data professional, I'm biased, but I believe that data is the key to successful collaboration in managing complex systems. We can't manage by "feelings," and we can't manage by looking at silo-ed data. With PerfStack, we have an analytics system, with data visualizations, to help us get to the cause faster, with less pain-and-blame. This makes us all look better to the business. They become more confident in us because, as one CEO told me, "You all look like you know what you are doing." That helped when we went to ask for more resources."


Karen: We should all look good to the CEO, right?


datachick ranted in 5 Anti-Patterns to IT Collaboration: Data Will Save You


"These anti-patterns don't just increase costs, decrease team function, increase risk, and decrease organizational confidence, they also lead to employee dissatisfaction and morale. That leads to higher turnover (see above) and more pressure on good employees. Having the right data, at the right time, in the right format, will allow you to get to the root cause of issues, and better collaborate with others faster, cheaper, and easier.  Also, it will let you enjoy your 3:00 ams better."


I enjoyed sharing my thoughts on these topics and reading other people's posts as well. It seems bloggers here shared the same underlying theme of collaboration and teamwork. That made this Canadian Data Chick happy. Go, everyone. Solve problems together.  Do IT better.  And don't let me catch you trying to do any of that without data to back you up. Be part of #TeamData.

There are many books, websites, and probably self-help videos devoted to teaching or explaining the art of troubleshooting. Most are specific to an industry, and further to a problem domain within that industry. Within each problem domain, within each industry, within each methodology, there are tools of the trade designed to help you solve whatever problem is vexing you at that moment. The specificity of all of this, however, can be abstracted out of these insular, domain-specific modalities to affect a greater understanding of the role of troubleshooting in general.


It goes without saying that you cannot find something that you do not know you are looking for, and yet this is what a lot of neophyte engineers instinctively try. “The phones are down” may seem like the problem you need to fix, but counterintuitively that is only a symptom of the real problem. The real problem, the one causing the phones to be down, lies elsewhere. While you run around trying to figure out what’s up with the phones, what you should be thinking is, “For what reason(s) is/are the phones ‘down’?” and move from there. For example, are all the phones down? Some? Are there other symptoms? And what has changed recently, if anything? Once you’ve worked through some of this, which may only take seconds or minutes for a seasoned engineer, you’re more prepared to move onto the next steps.


Analyzing the problem(s), or problem statements, will help you to form some hypothesis as to where the problem is likely to lie. Now, how can you begin testing your ideas to see if you are on the right track? Well, in the IT world that we all live in (I know, I said abstracted…), you’re going to need information. Information gathering can be a manual process, and in many cases must be, but having good tools at your disposal can certainly help the process along the way, especially when you are shooting in the dark, so to say. Again, if you don’t know what you don’t know, an automated and impartial tool can help.


Tool impartiality is often overlooked as a step in the discovery phase of troubleshooting any problem. Plumbers have scopes to look inside of pipes that they cannot see; electricians have multi-meters to help them test connectivity, resistance, etc.; and you as an IT professional have tools like PerfStack. A tool like this happily gathers information from all of your systems, jumping to no conclusions, and can call out abnormalities in the steady state of a system. Where many engineers skip straight to the “trying to fix anything they suspect is the problem” phase, PerfStack simply presents what it sees in an impartial and authoritative manner. From its dashboards, an engineer can begin his/her search from a position of knowledge. Combine that with the wisdom that comes from experience, and you have a very strong team.


Mean time to innocence (MTTI) is a somewhat tongue-in-cheek metric in IT shops these days, referring to the amount of time it takes an engineer to prove that the domain for which they have responsibility is not, in fact, the cause of whatever problem is being investigated. In order to quantify an assessment of innocence you need information, documentation that the problem is not yours, even if you cannot say with any certainty who does own the problem. To do this, you need a tool that can generate impersonal, authoritative proof you can stand on, and which other engineers will respect. This is certainly helped if a system-wide tool, trusted by all parties, is a major contributor to this documentation.


A tool like PerfStack will certainly help in getting buy-off from the pointy-haired bosses as to what needs to happen to fix whatever needs fixing. Most organizations have a change control process--though likely an amended one during any kind of outage—and documentation is always a part of that. And all of this stuff, this paper trail from beginning to end, flows together nicely right into the final package that many organizations require for a post-mortem. Engineers and management can get through an after-the-fact incident meeting much quicker, and with likely consensus, with a clean and robust set of documents.


At the end of the day, troubleshooting is an art no matter what you do, where you do it, or in what industry you live. The methodologies are largely the same at a macro level, as are the need for quality tools. Can a great engineer find the root cause of a problem without a comprehensive tool like PerfStack? Sure. A cobbled together band of point tools has always been a part of the engineer’s toolkit and likely always will be, at least until our new sentient robotic overlords obviate the need for that. But a full-scale, system-wide solution like PerfStack should also be a part of any well-stocked engineering team’s process. After all, it can help find those things you do not yet know you are looking for.

The Ides of March are upon us. And with the Ides comes one of my least favorite things: Daylight Saving Time. I'm one of those "UTC forever" fanboys, because I've suffered having to work with systems that fail to consider how to track, or properly convert, datetimes. On the other hand, I do recognize the trouble by trying to convert everyone to UTC. The link at the end is a nice thought experiment for anyone who has to work with datetimes.


As always, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!


Spammergate: The Fall of an Empire

One dozen people, 1.4 billion emails a day. Nice summary of how a handful of groups came together to take this spam empire offline.


The WikiLeaks CIA hacking documents include spy tools literally from sci-fi

Most of the articles about the CIA spy hacks last week were like this one: clickbait. The details in the Wikileaks appear to be quite dated, resulting in Apple, Google, and Microsoft to all declare that the vulnerabilities have been patched long ago.


Artificial intelligence: Cooperation vs. aggression

A nice reminder that computers will do what we tell them to do. If we tell them to shoot people, we shouldn't be surprised when they shoot people.


DevOps has reached critical mass, CIOs need to get on board

I dislike the marketing term DevOps, but I *love* how it helps describe a modern software development lifecycle. It's hard to believe that there is any CIO out there that isn't subscribing to such methods.


Serverless is the new Multitenancy

A quick summary of the future of SaaS. Server-less architecture is going to allow cloud providers the ability to scale further than they do now. I suspect that for the end-user we won't have to worry about creating such functions, we will just click buttons and the plumbing will be handled for us.


30 Questions to Ask a Severless Fanboy (or Fangirl)

Because if server-less gets brought into a discussion you are having, you should be prepared to ask a few basic questions.


So You Want Continuous Time Zones

Someone had a bit of time on their hands, pun intended. This thought experiment was worth the time, and has a wonderful conclusion: "The sad, ultimate truth of modern timekeeping is this: it's not perfect, but it doesn't honestly get a whole lot better."


With the Perfstack™ launch this week, I am reminded how much I love working here at SolarWinds, and this image best describes why:


One of the common challenges in troubleshooting performance issues is that the multiple dimensions belong to different teams. Co-ordinating the troubleshooting across the teams can bring its own challenges. I really like the feature of PerfStack where the dashboard URL contains all of the information required to recreate the dashboard. The net result is that I can paste that one URL into my help desk ticket to include the evidence to hand off an issue to another team. Equally, when another team sends me a ticket it can already have a dashboard to jump start my troubleshooting.


I’ve seen help desk tickets bounce around from team to team within large organizations. As any network engineer will tell you, the network is always blamed first. To prove the issue isn’t the network, you craft together some graphs showing that all the latency is in a VM. Then you paste a screenshot of the graphs into the help desk system and reassign the ticket to the virtualization team. Shortly afterward the virtualization team says they are unable to see the issue and can you provide more details. This poor handoff between departments slows the whole process. The handoff makes it difficult to resolve the problem for the application end-users. It also makes every team feel like the other teams are idiots because they cannot see the obvious problems.


With PerfStack, you are able to hand the virtualization team a live graph showing the performance issue as being a VM problem. The virtualization team can take that URL and make changes to the dashboard. They might add VM specific counters and also information from inside the operating system. The VM team may identify that the issue is happening within  SQL server. They hand it off to the DBAs, with the URL for an updated dashboard. The DBAs rebuild the indices (or something) and all the performance problems go away. The important thing is that the handoff between teams has far more actionable information. Each team can take the information from the previous team and adapt it to their own view of the world. The context of each team's information remains through the URLs in the ticket. This encapsulation into an URL was one of my favorite little features of the PerfStack demonstration.


One thing to keep in mind is that collaborative troubleshooting is more productive than playing help desk ticket ping pong. It definitely helps the process to have experts across the disciplines working together in real-time. It helps both with resolving the problem at hand and with future problems. Often each team can learn a little of the other team’s specialization to better understand the overall environment. Another under-appreciated aspect is that it helps people to understand that the other teams are not complete idiots, that each specialization has its own issues and complexity.

By Joe Kim, SolarWinds Chief Technology Officer


Analysts and other industry experts have defined application performance monitoring as software that incorporates analytics, end-user experience monitoring, and a few other components, but this is just a small and basic part of the APM story. True APM is rich and nuanced, incorporating different approaches and tools for one common goal: keeping applications healthy and running smoothly.


Two Approaches to APM


How you use APM will depend on your agency’s environment. For example, you may prefer an APM approach that allows you to go inside underperforming applications and make changes directly to the code. In other cases, you may simply need to assess the overall viability of applications to help ensure their continued functionality. There are two very different methodologies that address both of these needs.


To solve a slow application problem, you may wish to dig down into the code itself to discover how long it takes for each portion of that code to process a transaction. From this, you’ll be able to determine, in aggregate, the total amount of transaction processing time for that application.


For this, you can use application bytecode instrumentation monitoring (ABIM) and Distributed Tracing. ABIM allows you to insert instrumentation into specific parts of the code. Monitoring processing times gives you information to accurately pinpoint where the problem exists and rectify the issue. For more complex application infrastructure that are distributed in nature, you can use Distributed Tracing to tag and track processes that go across multiple stacks and platforms. It’s a very specific and focused approach to APM, almost akin to a surgeon extracting a tumor.


Another, more general – though no less effective – approach is application interface performance management (AIPM). If ABIM is the surgeon’s tool, AIPM is something that a general practitioner might use.


AIPM allows you to monitor response times, wait times, and queue length, and provides near real-time visibility into application performance. You can receive instant alerts and detailed analytics regarding the root cause of problems. Once issues are identified, you can respond to them quickly and help your agency avoid unnecessary and costly application downtime.


Tools and Their Features


There are a number of different monitoring solutions on the market, and it can be hard to determine which technologies will best fit your agency’s unique needs. Most of them will do the basics – alerts, performance metrics, etc. — but there are certain specialized features you’ll also want to look out for:


Insight into all of your applications. Applications are the lifeblood of an agency, and you’ll need solutions that provide you with insight into all of them, preferably from a single dashboard or control point.


A glimpse into the health of your hardware. Hardware failure can cause application performance issues. You’ll need to be able to monitor server hardware components and track things like high CPU load and other issues to gain insight into how they may be impacting application performance.


The ability to customize for different types of applications. Different types of applications (for example, custom or home-grown apps) may have various monitoring requirements you’ll need tools that are adaptable depending on the applications in your stack.


As you can see, APM is far more intricate than some may have you believe, and that’s a good thing. You have far more resources at your fingertips than you may have thought. With the right combination of approaches and tools, you’ll be able to tackle even the trickiest application performance issues.


Find the full article on our partner DLT’s blog, Technically Speaking.

Most of the time, IT pros gain troubleshooting experience via operational pains. In other words, something bad happens and we, as IT professionals, have to clean it up. Therefore, it is important for you to have a troubleshooting protocol in place that is specific to dependent services, applications, and a given environment. Within those parameters, the basic troubleshooting flow should look like this:


      1. Define the problem.
      2. Gather and analyze relevant information.
      3. Construct a hypothesis on the probable cause for the failure or incident.
      4. Devise a plan to resolve the problem based on that hypothesis.
      5. Implement the plan.
      6. Observe the results of the implementation.
      7. Repeat steps 2-6.
      8. Document the solution.


Steps 1 and 2 usually lead to a world of pain. First of all, you have to define the troubleshooting radius, the surface area of systems in the stack that you have to analyze to find the cause of the issue. Then, you must narrow that scope as quickly as possible to remediate the issue. Unfortunately, remediating in haste may not actually lead to uncovering the actual root cause of the issue. And if it doesn’t, you are going to wind up back at square one.


You want to get to the single point of truth with respect to the root cause as quickly as possible. To do so, it is helpful to combine a troubleshooting workflow with insights gleaned from tools that allow you to focus on a granular level. For example, start with the construct that touches everything, the network, since it connects all the subsystems. In other words, blame the network. Next, factor in the application stack metrics to further shrink the troubleshooting area. This includes infrastructure services, storage, virtualization, cloud service providers, web, etc. Finally, leverage a collaboration of time-series data and subject matter expertise to reduce the troubleshooting radius to zero and root cause the issue.


If you think of the troubleshooting area as a circle, as the troubleshooting radius approaches zero, one gets closer to the root cause of the issue. If the radius is exactly zero, you’ll be left with a single point. And that point should be the single point of truth about the root cause of the incident.


Share examples of your troubleshooting experiences across stacks in the comments below.

Late last month, shockwaves were sent through the SAP customer base as a UK court ruled in favor of SAP and against the mega spirits supplier Diageo in an indirect licensing case. The court determined that Diageo was violating SAP’s licensing T&Cs when they were connecting a 3rd-party app to their SAP ERP for a myriad of business process life cycles. In their claim, SAP is asking for £60m in unpaid fees. Yes, £60m! Pending appeal, the court will make a decision on the actual amount to be paid within the month. As a fellow SAP customer, my company is now in a hurry to audit all the systems that are connecting to our SAP ERP to verify compliance, regardless of the fact that we conduct a license “True Up” with SAP every year.


This case reminds me of a licensing change that Microsoft made for SQL Server back in 2011, aka “The Money Grab." Microsoft decided to change enterprise agreement licensing in late 2011 for SQL servers from per-processor to per-core. This left many companies, mine included, scrambling to reduce, consolidate, or eliminate SQL servers ahead of their enterprise agreement renewal with Microsoft, usually with only a couple of months’ notice.


A common, and humorous, comparison that I often come across is that Lincoln’s historic Gettysburg Address clocks in at a shade over two minutes, yet the standard EULA for any software these days is more than three pages. Who has the time or patience to read that? Now ask yourself, how many software packages and applications do you have running across your enterprise? Do you, or someone else at your company, know the terms and conditions of the licensing for these software packages? Better yet, are they being regularly audited for compliance and/or usage reviewed to minimize spend? Don’t fear. There are many firms out there ready to provide their services when it comes to software license audits, but for a hefty sum.


It's difficult to predict the next “Money Grab” and who it will come from. I predict that as more companies go all in with the cloud, it will come from there. Think about it: IAAS equals cheap space and cheap processing for hungry consumers.


How do you react when it is too late and the vendor is knocking on your door? How do you remain proactive, stay organized, and prevent sprawl? Do you have all your T&Cs on file?

Well, here we are in our final post in the series. We’ve discussed several topics related to entering the network security job force. And with today’s market there’s more potential than ever to secure a job as an entry-level security analyst. The question we will address in this post is this: “How do I make the transition into a cybersecurity role, and then where do I go?”


Securing a job

First, you’ll need to polish up your resume if you plan on targeting a cybersecurity role. You’ll want to include your training and certifications, but what about experience? You could gain some experience by participating in open hackathons, which will allow you to demonstrate some security skills. Aside from that, you could volunteer or intern part time to gain some valuable experience. I have a friend who requested to be the network liaison for any security projects his company had. Being on the team that deployed FirePOWER helped him immensely.


Job boards

Once your resume is polished, you’ll want to head to the job boards. Today, I find that LinkedIn provides a pretty active environment filled with recruiters that scour the vast pool of online profiles. If you’re looking for some temp to hire work, this might be a good place to begin. Aside from that, the standard job sites exist, but more often than not it’s best to have someone you know that’s already in a role that can help you out. Have you found success using LinkedIn? If so, I’d love or hear your comments about the process, as well as any recommendations. Share them in the comments.


Lets keep this a secret

I was talking to a colleague some time back about a new position he took with the federal government. He was already on a networking team that managed an unclassified network, and his day-to-day was pretty mundane. After his transition into a security team, he was having a hard time with the secrecy about his work. It wasn’t so much that he couldn’t talk about anything, it was more that he had to be very careful about what he said. Assume he’s out having drinks with some co-workers. In casual conversation he mentions that he is dealing with a widespread breech inside the government network that has caused certain data to be leaked. Unknowingly there is a guy next to him at the bar that works for the press. The next morning there’s a front page story about data loss at the Pentagon. You see ow bad this could be, right? In actuality, he doesn’t work at the Pentagon, the data that was leaked was unclassified reports about tidal flows, and the government agency he works for is NOAA. I should mention here that this scenario is complexity fabricated to simply make my point. When you transition into a security role, you’re going to have to learn to keep a tight lip on what you’re doing, more so than when you worked on the network team.



I’ll keep this section brief. Are there politics to play in the cybersecurity job force? Yep. But I don’t play them, or even attempt to comment on them. Just do your job to the best of your ability.



You’ll need to beef up your education a bit more than before if your transitioning from a networking role. The world of security changes more rapidly, and threads morph and take on new forms much more aggressively than ever before. InfoSec World is a trade show that you may be interested in following. There are others you may want to attend at least once a year, for the purpose of networking with peers and receiving updates on the latest threats, and products that can help mitigate them. You may not have much of a say in your organization's purchasing decisions, but if you can add intelligent dialogue to those conversations, you are much more valuable as an employee.


Where to go from there?

From there, I’d recommend working your way up through the ranks. Decide what niche you want to focus on and become a specialist in that area. Keep current In your certifications in the event you need to look to another organization for employment. It’s good to be a loyal employee, but your loyalty should be first and foremost to you and your family. If you are being taken advantage of in your current position, quietly find work elsewhere and do it the right way. Give your notice and don’t burn bridges. This world is small and odds are you may cross paths with former supervisors in the future.


There’s so much to do in the world of cybersecurity. Really, the sky’s the limit. If you’re on the verge of a transition to a security role, I wish you the best and urge you to keep on learning. Maybe you can even give back some of what you glean from the community by contributing yourself.



I've worked in IT for a long time (I stopped counting at twenty years.  Quite a while ago.)  This experience means that I generally do well in troubleshooting in data--related areas.  Other areas like networking and I'm pretty much done at "do I have an IP address" and "is it plugged in?"


This is why team collaboration on IT issues, as I posted before, is so important.


What Can Go Wrong?


One of the things I've noticed is that while people can be experts in deploying solutions, this doesn't mean they are great at diagnosing issues. You've worked with that guy.  He's great at getting things installed and working.  But when things go wrong, he just starts pulling out cables and grumbling about other people's incompetence.  He keeps making changes and does several at the same time.  He's a nightmare.  And when you try to step in to help him get back on a path, he starts laying blame before he starts diagnosing the issue. You don't have to be that guy, though, to have challenges in troubleshooting.


Some of the effects that can contribute to troubleshooting challenges:


Availability Heuristic


If you have recently solved a series of NIC issues, the next time someone reports slow response times, you're naturally going to first consider a NIC issue.  And many times, this will work out just fine.  But if it constrains your thinking, you may be slow to get to the actual cause.  The best way to fight this cognitive issue is to gather data first, then assess the situation based on your entire troubleshooting experience.


Confirmation Bias


Confirmation Bias goes hand in hand with availability heuristic. Once you have narrowed the causes you think are causing this response time metric, your brain will want you to go look for evidence that the problem is indeed the network cards.   The best way to fight this is to recognize when you are looking for proof instead of looking for data.  Another way to overcome confirmation bias is to collaborate with others on what they are seeing.  While groupthink can be a issue, it's less likely for a group to share the same confirmation bias equally.


Anchoring Heuristic


So to get here, you have limited your guesses to recent issues, you have searched out data to prove the correctness of your diagnosis and now you are anchored there.  You want to believe.  You may start rejecting and ignoring data that contradicts your assumptions. In a team environment, this can be one of the most frustrating group troubleshooting challenges. You definitely don't want to be that gal.  The one who won't look at all the data. Trust me on this.




I use intuition a lot when I diagnose issues.  It's a good thing, in general.  Intuition helps professionals take a huge amount of data and narrow it down to a manageable set of causes. It's usually based on having dealt with similar issues hundreds or thousands of times over the course of your career.  But intuition without follow up data analysis can be a huge issue.  This often happens due to ego or lack of experience.  Dunning Kruger syndrome (not knowing what you don't know) can also be a factor here.


There are other challenges in diagnosing causes and effects of IT issues. I highly recommend reading up of them so you can spot these behaviours in others and yourself.


Improving Troubleshooting Skills


  1. Be Aware.
    The first thing you can do to improve the speed and accuracy of your troubleshooting is to recognize these behaviours when you are doing them.  Being self-aware, especially when you are under pressure to bring systems back online or have a boss pacing behind your desk asking "when will this be fixed?" will help you focus on the right things.  In a truly collaborative, high trust environment, team members can help others check whether they are having challenges in diagnosing based on the biases above.
  2. Get feedback.
    We are generally luck in IT that we, unlike other professions,  can almost always immediately see the impact of our fixes to see if they actually fixed the problem.  We have tools that report metrics and users who will let us know if we were wrong.  But even post-event analyses, documenting what we got right, what we got wrong can help us improve our methods
  3. Practice.
    Yes, every day we troubleshoot issues.  That counts as practice.  But we don't always test ourselves like other professions do.  Disaster Recovery exercises are a great way to do this, but I've always thought we needed troubleshooting code camps/hackathons to help us hone our skills. 
  4. Bring Data.
    Data is imperative to punching through the cognitive challenges listed above.  Imagine diagnosing a data-center wide outage and having to start by polling each resource to see how it's doing.  We must have data for both intuitive and analytical responses.
  5. Analyze.
    I love my data.  But it's only and input into a diagnostic process.  Metrics, considered in a holistic, cross-platform, cross team view is the next step.  A shared analysis platform makes combining and overlaying data to get to the real answers makes all this smoother and faster.
  6. Log What Happened. 
    This sounds like a lot of overhead when you are under pressure (is your boss still there?), but keeping a quick list of what was done, what your thought process was, what others did can be an important part of professional practice.  Teams can even share the load of writing stuff down.  This sort of knowledgebase is also important for when your run into the rare things that that have a simple solution but you can't remember exactly what to do (or even not to do).

A person with experience can be a experienced non-expert. But with data, analysis and awareness of our biases and challenges in troubleshooting, we can get problems solved faster and with better accuracy. The future of IT troubleshooting will be based more and more on analytical approaches.


Do you have other tips for improving your troubleshooting and diagnostic skills?  Do you think we should get formal training in troubleshooting?

One of the hot topics in the software engineering world right now is the idea of what is being called the “full stack developer.” In this ecosystem, a full stack developer is someone who, broadly speaking, is able to write applications that encompass the front-end interface, the core logic, and the database backend. In other words, where it is common for engineers to specialize in one area or another, becoming very proficient in one or two languages, a full stack engineer is proficient in many languages, systems, and modalities so that they can, given the time, implement an entire non-trivial application from soup to nuts.


In the network engineering world, this might be analogous to an engineer who has worked in many roles over a lifetime, and has developed skills across the board in storage, virtualization, compute, and networking. Such an engineer has likely worked in mid-level organizations where siloes are not as prevalent or necessary as in larger organizations. Many times these are the engineers who become senior level architects in organizations, or eventually move into some type of consulting role, working for many organizations as a strategist. This is what I’ll call the full stack IT engineer.


While the skills and background needed to get to this place in your career likely put you into the upper echelon of your cohort, there can be some pitfalls. The first of which is the risk of ending up with a skillset that is very broad, but not very deep. Being able to talk abut a very wide scope and scale of the IT industry, based on honest, on the ground experience is great, but it also becomes difficult to maintain that deep level of skill in every area of IT. In the IT industry, however, I do not see this as a weakness per say. If you’ve gotten to this level of skill and experience, you are hopefully in a more consultative role, and aren’t being called on to put hands on keyboard daily. The value you bring at this level is that of experience, and the ability to see the whole chess board without getting bogged down in any one piece.


The other pitfall along the road to becoming a full stack engineer is the often overlooked aspect of training, whether on the job or on your own. If you are not absolutely dedicated to your craft, you will never, quite frankly, get to this level in your career. You’re going to be doing a daily job, ostensibly focusing on less than a broad spectrum of technologies. While you may move into those areas later, how do you learn today? And when you do move into other technologies, how to you keep the skills of today fresh for tomorrow? Honestly, the only way you’ll get there is to study, study, study pretty much all of the time. You have to become a full time student, and develop a true passion for learning. Basically, the whole stack has to become your focus, and if you’re only seeing part of it at work, you have to find the rest at home.


What does all of this have with Solarwinds and PerfStack? Simple: troubleshooting using a—wait for it—full stack solution is going to expose you to how the other half lives. Since PerfStack allows, and encourages, dashboards (PerfStacks) to be passed along as troubleshooting ensues, you should have some great visibility into the challenges and remedies that other teams see. If you’re a virtualization engineer and get handed a problem to work, presumably the network team, datacenter team, facilities, and probably storage have all had a hand in ascertaining where the problem does or does not lie. Pay attention to that detail, study it, ask questions as you get the opportunity. Make time to ask what certain metrics mean, and why one metric is more important than another. Ask the voice guys whether jitter or latency is worse for their systems, or the storage guys why IOPs matter. Ask the VM team why virtual machine mobility needs 10ms or less (generally) in link latency, or why they want stretched layer-2 between data centers.


It may seem banal to equate the full stack IT engineer with a troubleshooting product (even as great as PerfStack is) but the reality is that you have to take advantage of everything that is put in front of you if you want to advance your career. You’re going to be using these tools on a regular basis anyhow, why not take advantage of what you have? Sure, learn the tool for what it’s designed for, and learn your job to the best of your ability, but also look for opportunities like these to advance your career and become more valuable to both your team and the next team you’re on, whether at your current company or a new one down the road.

It was a busy week for service disruptions and security breaches. We had Amazon S3 showing us that, yes, the cloud can go offline at times. And we found out that out teddy bears may be staging an uprising. And we also found that Uber has decided to use technology and data to continue operating illegally in cities and towns worldwide. Not a good week for those of us that enjoy having data safe, secure, and available.


So, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!


Data from connected CloudPets teddy bears leaked and ransomed, exposing kids' voice messages

The ignorance (or hubris) of the CloudPets CEO is on full display here. I am somewhat surprised that anyone could be this naive with regard to security issues these days.


Yahoo CEO Loses Bonus Over Security Lapses

Speaking of security breaches, Yahoo is in the news again. You might think that losing $2 million USD would sting a bit, but considering the $40 million she gets for running Yahoo into the ground I think she will be okay for the next few years, even with living in the Valley.


Hackers Drawn To Energy Sector's Lack Of Sensors, Controls

I'd like to think that someone, somewhere in our government, is actively working to keep our grid safe. Otherwise, it won't be long before we start to see blackouts as a result of some bored teenager on a lonely summer night.


Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

Thanks to a typo, the Amazon S3 service was brought to a halt for a few hours last week. In the biggest piece of irony, the popular website Is It Down Right Now? Website Down or Not? was, itself, down as a result. There's a lot to digest with this outage, and it deserves its own post at some point.


How Uber Deceives the Authorities Worldwide

I didn't wake up looking for more reasons to dislike how Uber is currently run as a business, but it seems that each week they reach a new low.


Thirteen thousand, four hundred, fifty-five minutes of talking to get one job

A bit long, but worth the read as it helps expose the job hiring process and all the flaws in the current system used by almost every company. I've written about bad job postings before, as well as how interviews should not be a trivia contest, so I enjoyed how this post took a deeper look.


If the Moon Were Only 1 Pixel - A tediously accurate map of the solar system

Because I love things like this and I think you should, too.


Just a reminder that the cloud can, and does, go offline from time to time:




Last week Amazon Web Services S3 storage in the East region went offline for a few hours. Since then, AWS has published a summary review of what happened. I applaud AWS for their transparency, and I know that they will use this incident as a learning lesson to make things better going forward. Take a few minutes to read the review and then come back here. I'll wait.


Okay, so it's been a few days since the outage. We've all had some time to reflect on what happened. And, some of us, have decided that now is the time to put on our Hindsight Glasses and run down a list of lingering questions and comments regarding the outage.


Let's break this down!


"...we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years."

This, to me, is the most inexcusable part of the outage. Anyone that does business continuity planning will tell you that annual checks are needed on such play books. You cannot just wave that away with, "Hey, we've grown a lot in the past four years and so the play book is out of date." Nope. Not acceptable.


"The servers that were inadvertently removed supported two other S3 subsystems."

The engineers were working on a billing system, and they had no idea that those billing servers would impact a couple of key S3 servers. Which brings about the question, "Why are those systems related?" Great question! This reminds me of the age-old debate regarding dedicated versus shared application servers. Shared servers sound great until one person needs a reboot, right? No wonder everyone is clamoring for containers these days. Another few years and mainframes will be under our desks.


"Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended."

But the command was allowed to be accepted as valid input, which means the code doesn't have any check to make certain that the command was indeed valid. This is the EXACT scenario that resulted in Jeffrey Snover adding the -WHATIF and -CONFIRM parameters into Powershell. I'm a coding hack, and even I know the value in sanitizing your inputs. This isn't just something to prevent SQL injection. It's also to make certain that as a cloud provider you don't delete a large number, or percentage, of servers by accident.


"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly."

So, they don't ever ask themselves, "What if?" along with the question, "Why?" These are my favorite questions to ask when designing/building/modifying systems. The 5-Whys is a great tool to find the root cause, and the use of "what if" helps you build better systems that help avoid the need for root cause reviews.


"We will also make changes to improve the recovery time of key S3 subsystems."

Why wasn't this a thing already? I cannot understand how AWS would get to the point that it would not have high availability already built into their systems. My only guess here is that building such systems costs more, and AWS isn't interested in things costing more. In the race to the bottom, corners are cut, and you get an outage every now and then.


"...we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3."

The AWS dashboard for the East Region was dependent upon the East Region being online. Just let that sink in for a bit. Hey, AWS, let me know if you need help with monitoring and alerting. We'd be happy to help you get the job done.


"Other AWS services in the US-EAST-1 Region that rely on S3 for storage...were also impacted while the S3 APIs were unavailable."

Many companies that rely on AWS to be up and running were offline. My favorite example is the popular website Is It Down Right Now? Website Down or Not? was itself, down as a result of the outage. If you migrate your apps to the cloud, you need to take responsibility for availability. Otherwise, you run the risk of being down with no way to get back up.


Look, things happen. Stuff breaks all the time. The reason this was such a major event is because AWS has done amazing work in becoming the largest cloud provider on the planet. I'm not here to bury AWS, I'm here to highlight the key points and takeaways from the incident to help you make things better in your shop. Because if AWS, with all of its brainpower and resources, can still have these flaws, chances are your shop might have a few, too. 

I have been talking about the complexity of resolving performance issues in modern data centers. I’ve particularly been talking about how it is a multi-dimensional problem. Also, that virtualization significantly increases the number of dimensions for performance troubleshooting. My report of having been forced to use Excel to coordinate brought some interesting responses. It is, indeed, a very poor tool for consolidating performance data.


I have also written in other places about management tools that are focused on the data they collect, rather than helping to resolve issues. What I really like about PerfStack is the ability to use the vast amount of data in the various SolarWinds tools to identify the source of performance problems.


The central idea in PerfStack is to gain insights across all of the data that is gathered by various SolarWinds products. Importantly, PerfStack allows the creation of ad hoc data collections of performance data. Performance graphs for multiple objects and multiple resource types can be stacked together to identify correlation. My favorite part was adding multiple different performance counters from the different layers of infrastructure to a single screen. This is where I had the Excel flashback, only here the consolidation is done programmatically. No need for me to make sure the time series match up. I loved that the performance graphs were re-drawing in real- time as new counters were added. Even better was that the re-draw was fast enough that counters could be added on the off chance that they were relevant. When they are not relevant, they can simply be removed.  The hours I wasted building Excel graphs translate into minutes of building a PerfStack workspace.


I have written elsewhere about systems management tools that get too caught up in the cool data they gather. These tools typically have fixed dashboards that give pretty overviews. They often cram as much data as possible into one screen. What I tend to find is that these tools are inflexible about the way the data is combined. The result is a dashboard that is good at showing that everything is, or is not, healthy but does not help a lot with resolving problems. The dynamic nature of the PerfStack workspace lends itself to getting insight out of the data, and helping identify the root cause of problems. Being able to quickly assemble the data on the load on a hypervisor and the VM operating system, as well the application statistics speeds troubleshooting. The ability to quickly add performance counters for the other application dependencies lets you pinpoint the cause of the issue quickly. It may be that the root cause is a domain controller that is overloading its CPU, while the symptom is a SharePoint server that is unresponsive.


PerfStack allows very rapid discovery of issue causes. The value of PerfStack will vastly increase as it is rolled out across the entire SolarWinds product suite.


You can see the demonstrations of PerfStack that I saw at Tech Field Day on Vimeo: NetPath here and SAM here.

As IT professionals, we have a wide variety of tools at our disposal for any given task. The same can be said for the attackers behind the increasing strength and number of DDoS attacks. The latest trend of hijacked IoT devices, like the Mirai Botnet, deserve a lot of attention because of their prevalence and ability to scale, mostly due to a lack of security and basic protections. This is the fault of both manufacturers and consumers. However, DDoS attacks at scale are not really a new thing, because malware-infected zombie botnets have been around for a while. Some fairly old ones are still out there, and attackers don’t forget their favorites.


One of the largest attacks in 2016 came in October, and measured in at 517 Gbps. This attack was not a complex, application-layer hack, or a massive DNS reflection, but a massive attack from malware that has been around for more than two years, called Spike. Spike is commonly associated with x86 Linux-based devices (often routers with unpatched vulnerabilities), and is able to generate large amounts of application-layer HTTP traffic. While Mirai and other IoT botnets remained top sources of DDoS traffic in 2016, they were not alone.




The complexity of these attacks continues to evolve. What used to be simple volumetric flooding of UDP traffic has moved up the stack over time. Akamai reports that between Q4 2015 and Q4 2016 there was a 6% increase in infrastructure layer attacks (layer 3 & 4), and a 22% increase in reflection-based attacks. At the same time, while overall web application attacks decreased, there was a 33% increase in SQLi attacks.


The application layer attacks become increasingly difficult to mitigate due to their ability to mimic real user behavior. They are more difficult to identify, and often have larger payloads. They are often combined with other lower-level attacks for variety and larger attack surface. This requires vigilance on the part of those responsible for the infrastructure we rely on, to protect against all possible attack vectors.




Not surprising is the fact that China and the United States are the primary sources of DDoS attacks, with China dominating Q1, Q2, and Q3 of 2016. The United States “beat” China in Q4 spiking to 24% of global DDoS traffic for that quarter. The increase in the number of source IP addresses here is dramatic, with the U.S. numbers leaping from about 60K in Q3 to 180K in Q4. This is largely suspected to be due to a massive increase in IoT (Mirai) botnet sources. Black Friday sales, perhaps?


While attacks evolve, become larger and more complex, some simple tried-and-true methods of disrupting the internet can still be useful. Old tools can become new again. Reports from major threat centers consistently show that Conficker is still one of the most prevalent malware variants in the wild, and it has been around since 2008.


Malware is often modeled after real biological viruses, like the common cold, and they are not easily eliminated. A handful of infected machines can re-populate and re-infect thousands of others in short order, and this is what makes total elimination a near impossibility.


There is no vaccine for malware, but what about treating the symptoms?


A concerted effort is required to combat the looming and real threat these DDoS attacks pose. Manufacturers of infrastructure products, consumer IoT devices, mobile phones, service providers, enterprise IT organizations, and even the government are on the case. Each must actively do their part to reinforce against, protect from, and identify sources of malware to slow the pace of this growing problem.


The internet is not entirely broken, but it is vulnerable to the exponential scale of the DDoS threat.

By Joe Kim, SolarWinds Chief Technology Officer


It’s time to stop treating data as a commodity and create a secure and reliable data recovery plan by following a few core strategies.


1. Establish objectives


Establish a Recovery Point Objective (RPO) that determines how much data loss is acceptable. Understanding acceptable risk levels can help establish a baseline understanding of where DBAs should focus their recovery efforts.


Then, work on a Recovery Time Objective (RTO) that shows how long the agency can afford to be without its data.


2. Understand differences between backups and snapshots


There’s a surprising amount of confusion about the differences between database backups, server tape backups, and snapshots. For instance, many people have a misperception that a storage area network (SAN) snapshot is a backup, when it’s really only a set of data reference markers. Remember that a true backup, either on- or off-site, is one in which data is securely stored in the event that it needs to be recovered.


3. Make sure those backups are working


Although many DBAs will undoubtedly insist that their backups are working, the only way to know for sure is to test the backups by doing a restore. This will provide assurance that backups are running — not failing — and highly available.


4. Practice data encryption


DBAs can either encrypt the database backup file itself, or encrypt the entire database. That way, if someone takes a backup, they won’t be able to access the information without a key. DBAs must also ensure that if a device is lost or stolen, the data stored on the device remains inaccessible to users without proper keys.


5. Monitor and collect data


Combined with network performance monitoring and other analysis software, real-time monitoring and real-time data collection can improve performance, reduce outages, and maintain network and data availability.


Real-time collection of information can be used to do proper data forensics. This will make it easier to track down the cause of an intrusion, which can be detected through monitoring.


Monitoring, database analysis, and log and event management can help DBAs understand if something is failing. They’ll be able to identify potential threats through things like unusual queries or suspected anomalies. They can compare the queries to their historical information to gauge whether or not the requests represent potential intrusions.


6. Test, test, test


If you’re managing a large database, there’s simply not enough space or time to restore and test it every night. DBAs should test a random sampling taken from their databases. From this information, DBAs can gain confidence that they will be able to recover any database they administer, even if that database is in a large pool. If you’re interested in learning more, check out this post, which gets into further detail on database sampling.


Data is quickly becoming a truly precious asset to government agencies, so it is critical to develop a sound data recovery plan.


Find the full article on our partner DLT’s blog, Technically Speaking.

I’ve long held the belief that for any task there are correct approaches and incorrect ones. When I was small, I remember being so impressed by the huge variety of parts my father had in his tool chest. Once, I watched him repair a television remote control, one that had shaped and tapered plastic buttons. The replacement from RCA/Zenith, I believe at the time, cost upwards of $150. He opened the broken device, determined that the problem was that the tongue on the existing button had broken, and rather than epoxy the old one back together, he carved and buffed an old bakelite knob into the proper shape, attached it in place of the original one, and ultimately, the final product looked and performed as if it were the original. It didn’t even look different than it had. This, to me, was the ultimate accomplishment. Almost as the Hippocratic Oath dictates, above all, do no harm. It was magic.


When all you have is a hammer, everything is a nail, right? But that sure is the wrong approach.


Today, my favorite outside work activity is building and maintaining guitars. When I began doing this, I didn’t own some critical tools. For example, an entire series of “Needle Files” and crown files are appropriate for the shaping and repair of frets on the neck. While not a very expensive purchase, all other tools would fail in the task at hand. The correct Allen wrench is necessary for fixing the torsion rod on the neck. And the ideal soldering iron is critical for proper wiring of pickups, potentiometers, and the jack. Of course, when sanding, a variety of grades are also necessary. Not to mention, a selection of paints, brushes, stains, and lacquers.


The same can be said of DevOps. Programming languages are designed for specific purposes, and there have been many advances in the past few years pointing to what a scripting task may require. Many might use Bash, batch, or PowerShell to do their tasks. Others may choose PHP or Ruby on Rails, while still others choose Python as their scripting tools. Today, it is my belief that no one tool can accommodate every action that's necessary to perform these tasks. There are nuances to each language, but one thing is certain: many tasks require the collaborative conversation between these tools. To accomplish these tasks, the ideal tools will likely call functions back and forth from other scripting languages. And while some bits of code are required here and there, currently it's the best way to approach the situation, given that many tools don't yet exist in packaged form. The DevOps engineer, then, needs to write and maintain these bits of code to help ensure that they are accurate each time they are called upon. 


As correctly stated in comments on my previous posting, I need to stress that there must be testing prior to utilizing these custom pieces of code to help ensure that other changes that may have taken place within the infrastructure are accounted for each time these scripts are set to task.


I recommend that anyone who is in DevOps get comfortable with these and other languages, and learn which do the job best so that DevOps engineers become more adept at facing challenges.


At some point, there will be automation tools, with slick GUI interfaces that may address many or even all of these needs when they arise. But for the moment,  I advise learning, utilizing, and customizing scripting tools. In the future, when these tools do become available, the question is, will they surmount the automation tools you’ve already created via your DevOps? I cannot predict.

As you spend more time in security, you start to understand that keeping up with the latest trends is not easy. Security is a moving target, and many organizations simply can’t keep up. Fortunately for us, Cisco releases an annual security report that can help us out in this regard. You can find this year's report, as well as past reports, here. In this post, I wanted to share a few highlights that illustrate why I believe security professionals should be aware of these reports.


Major findings

A nice feature of the Cisco 2017 Annual Cyber Security Report is the quick list of major findings. This year, Cisco notes that the three leading exploit kits -- Angler, Nuclear, and Neutrino --  are vanishing from the landscape. This is good to know, because we might be spending time and effort looking for these popular attacks while other lesser-known exploit kits start working their way into the network. And based on Cisco’s findings, most companies are using several security vendors with more than five security products in their environment, and only about half of the security events received in a given day are reviewed. Of that number, 28% are deemed legitimate, and less than half that number are remediated. We’re having a hard time keeping up, and our time spend needs to be at a live target, not something that’s no longer prevalent.


Gaining a view to adversary activity

In the report's introduction, Cisco covers the strategies that adversaries use today. These include taking advantage of poor patching practices, social engineering, and malware delivery through legitimate online content, such as advertising. I personally feel that you can't defend your network properly unless you know how you’re being attacked. I suppose you could look at it this way. Here in the United States, football is one of the most popular sports. It’s common practice for a team to study films of their opponents before playing them. This allows them to adjust their offensive and defensive game plan ahead of time. The same should be true for security professionals. We should be prepared to adjust to threats, and reviewing Cisco’s security report is similar to watching those game films.


In the security report, Cisco breaks down the most commonly observed malware by the numbers. It also discusses how attackers pair remote access malware with exploits in deliverable payloads. Some of what I gleaned from the report shows that the methods being used are the same as what was brought out in previous reports, with some slight modifications.


My take

From my point of view, the attacks are sophisticated, but not in a way that’s earth shattering. What I get from the report is that the real issue is that there are too many alerts from too many security devices, and security people can't sort through them efficiently. Automation is going to play a key role in security products. Until our security devices are smart enough to distinguish noise from legitimate attacks, we’re not going to be able to keep up. However, reading reports like this can better position our security teams to look in the right place at the right time, cutting down on some of the breaches we see. So, to make a long story short, be sure to read up on the Cisco Annual Security report. It’s written well, loaded with useful data, and helps security professionals stay on top of the security landscape.

In our pursuit of Better IT, I bring you a post on how important data is to functional teams and groups. Last week we talked aboutnti-patterns in collaboration, covering things like data mine-ing and other organizational dysfunctions. In this post we will be talking about the role shared data, information, visualizations, and analytics play in helping ensure your teams can avoid all those missteps from last week.


Data! Data! Data!

These days we have data. Lots and lots of data. Even Big Data, data so important we capitalize it!. As much as I love my data, we can't solve problems with just raw data, even if we enjoy browsing through pages of JSON or log data. That's why we have products like NPM Network Performance Monitor Release Candidate , SAM Server & Applications Monitor Release Candidate and DPAThe specified item was not found.,  to help us collect and parse all that data.  Each of those products have specialized metrics they collect, meaning they apply to them and visualizations to help specialized SySadmins to leverage that data. These administrators probably don't think of themselves as data professionals, but they are. They choose which data to collect, which levels to be alerted on, and which to report upon. They are experts in this data and they have learned to love it all.

Shared Data about App and Infrastructure Resources

Within the SolarWinds product solutions, data about the infrastructure and application graph is collected and displayed on the Orion Platform. This means that cross-team admins share the same set of resources and components and the data about their metrics. Now we havePerfStack Livecast with features to do cross-team collaboration via data. We can see entities we want to analyze, then see all the other entities related them. This is what I call the Infrastructure and Application Graph, which I'll be writing about later. After choosing Entities, we can discover the metrics available for each of the entities and choose the ones that make the most sense to analyze based on the troubleshooting we are doing now.




Metrics Over Time


Another data feature that's critical to analyzing infrastructure issues is the ability to see data *over time." It's not enough to know how CPU is doing right now, we need to know what it was doing earlier today, yesterday, last week, and maybe even last month, on the same day of the month. By having a view into the status of resources over time, we can intelligently make sense of the data we are seeing today. End-of-month processing going on? Now we know why there might be slight spike in CPU pressure.


Visualizations and Analyses


The beauty of Perfstack is that by choosing these Entities and metrics we can easily build data visualizations of the metrics and overlay them to discover correlations and causes. We can then interact with the information we now have by working with the data or the visualizations. By overlaying the data, we can see how statuses of resources are impacting each other. This collaboration of data means we are performing "team troubleshooting" instead of silo-based "whodunits." We can find the issue, which until now might have been hiding in data in separate products.




So we've gone from data to information to analysis in just minutes. Another beautiful feature of PerfStack is that once we've built the analyses that show our troubleshooting results, we can copy the URL, send it off to team members, and they can see the exact same analysis -- complete with visualizations -- that we saw. If we've done similar troubleshooting before and saved projects, we might be doing this in seconds.

Save Project.png

This is often hours, if not days, faster than how we did troubleshooting in our previous silo-ed, data mine-ing approach to application and infrastructure support. We accomplished this by having quick and easy access to shared information that united differing views of our infrastructure and application graph.


Data -> Information -> Visualization -> Analysis -> Action


It all starts with the data, but we have to love the data into becoming actions. I'm excited about this data-driven workflow in keeping applications and infrastructure happy.

Needles and haystacks have a storied past, though never in a positive

sense. Troubleshooting network problems comes as close as any to the

process to which that pair alludes. Sometimes we just don't know what we

don't know, and that leaves us with a problem: how do we find the

information we're looking for when we don't know what we're looking for?



The geeks over at Solarwinds obviously thought the same thing and decided

to do something to make life easier for those hapless souls frequently

finding themselves tail over teakettle in the proverbial haystack; that

product is PerfStack.


PerfStack is a really cool component piece of the Orion Platform as of

the new 12.1 release. In a nutshell, what it allows you to do is to find

all sorts of bits of information that you're already monitoring, and view

it all in one place for easy consumption. Rather than going from this page

to that, one IT discipline-domain to another, or ticket to ticket,

PerfStack gives you more freedom to mix and match, to see only the bits

pertinent to the problem at hand, whether those are in the VOIP systems,

wireless, applications, or network. Who would have thought that would be

useful, and why haven't we thought of that before?




In and of itself, those features would be a welcome addition to the Orion

suite--or any monitoring suite, for that matter--but Solarwinds took it

one step further and designed PerfStack in such a way that you can create

your own "PerfStacks" on the fly, as well as passing them around for other

people to use. Let's face it, having a monitoring solution with a lot of

canned reporting, stuff that just works right out of the box, is a great

thing, but having the flexibility to create your own reports at a highly

granular level is infinitely better. Presumably you know your environment

better than the team at Solarwinds, or me, or anyone else. You

shouldn't be forced into a modality that doesn't fit your needs.


Passing dashboards ("PerfStacks") around to your fellow team members, or

whomever, is really a key feature here. Often we have a great view to the

domain we operate within, whether that's virtualization, applications,

networking, storage; but we don't have the ability to share

that with other people. That's certainly the case with point products, but

even when we are all sharing the same tools it's not historically been as

smooth a process as it could be. That's unfortunate, but PerfStack goes

a long way toward breaking through that barrier.




There are additional features to PerfStack that bear mentioning: real-time updates to dashboards without redrawing the entire screen, saving of the dashboards, importing in real-time of new polling targets/events, etc. I will cover those details next time, but what we've talked about so far should be enough to show the value of the product. Solarwinds doesn't seem to believe in tiny rollouts. They've come out of the gates fast and strong with this update, and with good reason. It really is a great and useful product that will change the way you look at monitoring and troubleshooting.

Back in the office this week and excited for the launch of PerfStack. If you haven't heard about PerfStack yet, you should check out the webcast tomorrow: PerfStack Livecast


As usual, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!


Cloudflare Coding Error Spills Sensitive Data

A nice reminder about how you are responsible for securing your data, not someone else. Although Cloudflare® was leaking data, a company such as 1Password was not affected because they were encrypting their data in more than one way. In short, 1Password assumed that SSL/TLS *can* fail, and took responsibility to secure their data, rather than relying on someone else to do that for them. We should all be mindful about how we treat our data.


Microsoft Invests in Real-time Maps for Drones, and Someday, Flying Cars

Can we skip autonomous cars and go right to flying cars? Because that would be cool with me. And Microsoft® is doing their part to make sure we won't need to use Apple® Maps with our flying cars.


Expanding Fact Checking at Google

Nice to see this effort underway. I'm not a fan of crowdsourced entities such as Wikipedia, as they have inherent issues with veracity. Would be good for everyone if we could start verifying data posted online as fact (versus opinion, or just fake).


Wikipedia Bots Spent Years Fighting Silent, Tiny Battles With Each Other

Did I mention I wasn't a fan of Wikipedia? As if humans arguing over facts aren't bad enough, someone thought it was a good idea to create bots to do the job instead.



Besides the issue with fact checking, the internet is also a cesspool of misery. Perspective is an attempt to use Machine Learning to help foster better (or nicer) conversations online. I'm curious to see how this project unfolds.


Microsoft Surface: NSA Approves Windows 10 Tablets for Classified Work

Interesting to note here that this is only for devices manufactured by Microsoft, and not other vendors such as HP® or Dell®. What's more interesting to note is how Microsoft continues to make progress in areas of data security for both their devices and the hosted services (Azure®).


Alphabet's Waymo Alleges Uber Stole Self-Driving Secrets

I am simply amazed at how many mistakes Uber® can make as a company and still be in business.


The weather has been warm and Spring-like, so I decided to put the deck furniture out. So now, if it snows two feet next week, you know who to blame:


Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.