Skip navigation
1 2 3 4 5 Previous Next

Geek Speak

2,102 posts

By Joe Kim, SolarWinds Chief Technology Officer


Analysts and other industry experts have defined application performance monitoring as software that incorporates analytics, end-user experience monitoring, and a few other components, but this is just a small and basic part of the APM story. True APM is rich and nuanced, incorporating different approaches and tools for one common goal: keeping applications healthy and running smoothly.


Two Approaches to APM


How you use APM will depend on your agency’s environment. For example, you may prefer an APM approach that allows you to go inside underperforming applications and make changes directly to the code. In other cases, you may simply need to assess the overall viability of applications to help ensure their continued functionality. There are two very different methodologies that address both of these needs.


To solve a slow application problem, you may wish to dig down into the code itself to discover how long it takes for each portion of that code to process a transaction. From this, you’ll be able to determine, in aggregate, the total amount of transaction processing time for that application.


For this, you can use application bytecode instrumentation monitoring (ABIM) and Distributed Tracing. ABIM allows you to insert instrumentation into specific parts of the code. Monitoring processing times gives you information to accurately pinpoint where the problem exists and rectify the issue. For more complex application infrastructure that are distributed in nature, you can use Distributed Tracing to tag and track processes that go across multiple stacks and platforms. It’s a very specific and focused approach to APM, almost akin to a surgeon extracting a tumor.


Another, more general – though no less effective – approach is application interface performance management (AIPM). If ABIM is the surgeon’s tool, AIPM is something that a general practitioner might use.


AIPM allows you to monitor response times, wait times, and queue length, and provides near real-time visibility into application performance. You can receive instant alerts and detailed analytics regarding the root cause of problems. Once issues are identified, you can respond to them quickly and help your agency avoid unnecessary and costly application downtime.


Tools and Their Features


There are a number of different monitoring solutions on the market, and it can be hard to determine which technologies will best fit your agency’s unique needs. Most of them will do the basics – alerts, performance metrics, etc. — but there are certain specialized features you’ll also want to look out for:


Insight into all of your applications. Applications are the lifeblood of an agency, and you’ll need solutions that provide you with insight into all of them, preferably from a single dashboard or control point.


A glimpse into the health of your hardware. Hardware failure can cause application performance issues. You’ll need to be able to monitor server hardware components and track things like high CPU load and other issues to gain insight into how they may be impacting application performance.


The ability to customize for different types of applications. Different types of applications (for example, custom or home-grown apps) may have various monitoring requirements you’ll need tools that are adaptable depending on the applications in your stack.


As you can see, APM is far more intricate than some may have you believe, and that’s a good thing. You have far more resources at your fingertips than you may have thought. With the right combination of approaches and tools, you’ll be able to tackle even the trickiest application performance issues.


Find the full article on our partner DLT’s blog, Technically Speaking.


The Troubleshooting Radius

Posted by kong.yang Employee Mar 10, 2017

Most of the time, IT pros gain troubleshooting experience via operational pains. In other words, something bad happens and we, as IT professionals, have to clean it up. Therefore, it is important for you to have a troubleshooting protocol in place that is specific to dependent services, applications, and a given environment. Within those parameters, the basic troubleshooting flow should look like this:


      1. Define the problem.
      2. Gather and analyze relevant information.
      3. Construct a hypothesis on the probable cause for the failure or incident.
      4. Devise a plan to resolve the problem based on that hypothesis.
      5. Implement the plan.
      6. Observe the results of the implementation.
      7. Repeat steps 2-6.
      8. Document the solution.


Steps 1 and 2 usually lead to a world of pain. First of all, you have to define the troubleshooting radius, the surface area of systems in the stack that you have to analyze to find the cause of the issue. Then, you must narrow that scope as quickly as possible to remediate the issue. Unfortunately, remediating in haste may not actually lead to uncovering the actual root cause of the issue. And if it doesn’t, you are going to wind up back at square one.


You want to get to the single point of truth with respect to the root cause as quickly as possible. To do so, it is helpful to combine a troubleshooting workflow with insights gleaned from tools that allow you to focus on a granular level. For example, start with the construct that touches everything, the network, since it connects all the subsystems. In other words, blame the network. Next, factor in the application stack metrics to further shrink the troubleshooting area. This includes infrastructure services, storage, virtualization, cloud service providers, web, etc. Finally, leverage a collaboration of time-series data and subject matter expertise to reduce the troubleshooting radius to zero and root cause the issue.


If you think of the troubleshooting area as a circle, as the troubleshooting radius approaches zero, one gets closer to the root cause of the issue. If the radius is exactly zero, you’ll be left with a single point. And that point should be the single point of truth about the root cause of the incident.


Share examples of your troubleshooting experiences across stacks in the comments below.

Late last month, shockwaves were sent through the SAP customer base as a UK court ruled in favor of SAP and against the mega spirits supplier Diageo in an indirect licensing case. The court determined that Diageo was violating SAP’s licensing T&Cs when they were connecting a 3rd-party app to their SAP ERP for a myriad of business process life cycles. In their claim, SAP is asking for £60m in unpaid fees. Yes, £60m! Pending appeal, the court will make a decision on the actual amount to be paid within the month. As a fellow SAP customer, my company is now in a hurry to audit all the systems that are connecting to our SAP ERP to verify compliance, regardless of the fact that we conduct a license “True Up” with SAP every year.


This case reminds me of a licensing change that Microsoft made for SQL Server back in 2011, aka “The Money Grab." Microsoft decided to change enterprise agreement licensing in late 2011 for SQL servers from per-processor to per-core. This left many companies, mine included, scrambling to reduce, consolidate, or eliminate SQL servers ahead of their enterprise agreement renewal with Microsoft, usually with only a couple of months’ notice.


A common, and humorous, comparison that I often come across is that Lincoln’s historic Gettysburg Address clocks in at a shade over two minutes, yet the standard EULA for any software these days is more than three pages. Who has the time or patience to read that? Now ask yourself, how many software packages and applications do you have running across your enterprise? Do you, or someone else at your company, know the terms and conditions of the licensing for these software packages? Better yet, are they being regularly audited for compliance and/or usage reviewed to minimize spend? Don’t fear. There are many firms out there ready to provide their services when it comes to software license audits, but for a hefty sum.


It's difficult to predict the next “Money Grab” and who it will come from. I predict that as more companies go all in with the cloud, it will come from there. Think about it: IAAS equals cheap space and cheap processing for hungry consumers.


How do you react when it is too late and the vendor is knocking on your door? How do you remain proactive, stay organized, and prevent sprawl? Do you have all your T&Cs on file?

Well, here we are in our final post in the series. We’ve discussed several topics related to entering the network security job force. And with today’s market there’s more potential than ever to secure a job as an entry-level security analyst. The question we will address in this post is this: “How do I make the transition into a cybersecurity role, and then where do I go?”


Securing a job

First, you’ll need to polish up your resume if you plan on targeting a cybersecurity role. You’ll want to include your training and certifications, but what about experience? You could gain some experience by participating in open hackathons, which will allow you to demonstrate some security skills. Aside from that, you could volunteer or intern part time to gain some valuable experience. I have a friend who requested to be the network liaison for any security projects his company had. Being on the team that deployed FirePOWER helped him immensely.


Job boards

Once your resume is polished, you’ll want to head to the job boards. Today, I find that LinkedIn provides a pretty active environment filled with recruiters that scour the vast pool of online profiles. If you’re looking for some temp to hire work, this might be a good place to begin. Aside from that, the standard job sites exist, but more often than not it’s best to have someone you know that’s already in a role that can help you out. Have you found success using LinkedIn? If so, I’d love or hear your comments about the process, as well as any recommendations. Share them in the comments.


Lets keep this a secret

I was talking to a colleague some time back about a new position he took with the federal government. He was already on a networking team that managed an unclassified network, and his day-to-day was pretty mundane. After his transition into a security team, he was having a hard time with the secrecy about his work. It wasn’t so much that he couldn’t talk about anything, it was more that he had to be very careful about what he said. Assume he’s out having drinks with some co-workers. In casual conversation he mentions that he is dealing with a widespread breech inside the government network that has caused certain data to be leaked. Unknowingly there is a guy next to him at the bar that works for the press. The next morning there’s a front page story about data loss at the Pentagon. You see ow bad this could be, right? In actuality, he doesn’t work at the Pentagon, the data that was leaked was unclassified reports about tidal flows, and the government agency he works for is NOAA. I should mention here that this scenario is complexity fabricated to simply make my point. When you transition into a security role, you’re going to have to learn to keep a tight lip on what you’re doing, more so than when you worked on the network team.



I’ll keep this section brief. Are there politics to play in the cybersecurity job force? Yep. But I don’t play them, or even attempt to comment on them. Just do your job to the best of your ability.



You’ll need to beef up your education a bit more than before if your transitioning from a networking role. The world of security changes more rapidly, and threads morph and take on new forms much more aggressively than ever before. InfoSec World is a trade show that you may be interested in following. There are others you may want to attend at least once a year, for the purpose of networking with peers and receiving updates on the latest threats, and products that can help mitigate them. You may not have much of a say in your organization's purchasing decisions, but if you can add intelligent dialogue to those conversations, you are much more valuable as an employee.


Where to go from there?

From there, I’d recommend working your way up through the ranks. Decide what niche you want to focus on and become a specialist in that area. Keep current In your certifications in the event you need to look to another organization for employment. It’s good to be a loyal employee, but your loyalty should be first and foremost to you and your family. If you are being taken advantage of in your current position, quietly find work elsewhere and do it the right way. Give your notice and don’t burn bridges. This world is small and odds are you may cross paths with former supervisors in the future.


There’s so much to do in the world of cybersecurity. Really, the sky’s the limit. If you’re on the verge of a transition to a security role, I wish you the best and urge you to keep on learning. Maybe you can even give back some of what you glean from the community by contributing yourself.



I've worked in IT for a long time (I stopped counting at twenty years.  Quite a while ago.)  This experience means that I generally do well in troubleshooting in data--related areas.  Other areas like networking and I'm pretty much done at "do I have an IP address" and "is it plugged in?"


This is why team collaboration on IT issues, as I posted before, is so important.


What Can Go Wrong?


One of the things I've noticed is that while people can be experts in deploying solutions, this doesn't mean they are great at diagnosing issues. You've worked with that guy.  He's great at getting things installed and working.  But when things go wrong, he just starts pulling out cables and grumbling about other people's incompetence.  He keeps making changes and does several at the same time.  He's a nightmare.  And when you try to step in to help him get back on a path, he starts laying blame before he starts diagnosing the issue. You don't have to be that guy, though, to have challenges in troubleshooting.


Some of the effects that can contribute to troubleshooting challenges:


Availability Heuristic


If you have recently solved a series of NIC issues, the next time someone reports slow response times, you're naturally going to first consider a NIC issue.  And many times, this will work out just fine.  But if it constrains your thinking, you may be slow to get to the actual cause.  The best way to fight this cognitive issue is to gather data first, then assess the situation based on your entire troubleshooting experience.


Confirmation Bias


Confirmation Bias goes hand in hand with availability heuristic. Once you have narrowed the causes you think are causing this response time metric, your brain will want you to go look for evidence that the problem is indeed the network cards.   The best way to fight this is to recognize when you are looking for proof instead of looking for data.  Another way to overcome confirmation bias is to collaborate with others on what they are seeing.  While groupthink can be a issue, it's less likely for a group to share the same confirmation bias equally.


Anchoring Heuristic


So to get here, you have limited your guesses to recent issues, you have searched out data to prove the correctness of your diagnosis and now you are anchored there.  You want to believe.  You may start rejecting and ignoring data that contradicts your assumptions. In a team environment, this can be one of the most frustrating group troubleshooting challenges. You definitely don't want to be that gal.  The one who won't look at all the data. Trust me on this.




I use intuition a lot when I diagnose issues.  It's a good thing, in general.  Intuition helps professionals take a huge amount of data and narrow it down to a manageable set of causes. It's usually based on having dealt with similar issues hundreds or thousands of times over the course of your career.  But intuition without follow up data analysis can be a huge issue.  This often happens due to ego or lack of experience.  Dunning Kruger syndrome (not knowing what you don't know) can also be a factor here.


There are other challenges in diagnosing causes and effects of IT issues. I highly recommend reading up of them so you can spot these behaviours in others and yourself.


Improving Troubleshooting Skills


  1. Be Aware.
    The first thing you can do to improve the speed and accuracy of your troubleshooting is to recognize these behaviours when you are doing them.  Being self-aware, especially when you are under pressure to bring systems back online or have a boss pacing behind your desk asking "when will this be fixed?" will help you focus on the right things.  In a truly collaborative, high trust environment, team members can help others check whether they are having challenges in diagnosing based on the biases above.
  2. Get feedback.
    We are generally luck in IT that we, unlike other professions,  can almost always immediately see the impact of our fixes to see if they actually fixed the problem.  We have tools that report metrics and users who will let us know if we were wrong.  But even post-event analyses, documenting what we got right, what we got wrong can help us improve our methods
  3. Practice.
    Yes, every day we troubleshoot issues.  That counts as practice.  But we don't always test ourselves like other professions do.  Disaster Recovery exercises are a great way to do this, but I've always thought we needed troubleshooting code camps/hackathons to help us hone our skills. 
  4. Bring Data.
    Data is imperative to punching through the cognitive challenges listed above.  Imagine diagnosing a data-center wide outage and having to start by polling each resource to see how it's doing.  We must have data for both intuitive and analytical responses.
  5. Analyze.
    I love my data.  But it's only and input into a diagnostic process.  Metrics, considered in a holistic, cross-platform, cross team view is the next step.  A shared analysis platform makes combining and overlaying data to get to the real answers makes all this smoother and faster.
  6. Log What Happened. 
    This sounds like a lot of overhead when you are under pressure (is your boss still there?), but keeping a quick list of what was done, what your thought process was, what others did can be an important part of professional practice.  Teams can even share the load of writing stuff down.  This sort of knowledgebase is also important for when your run into the rare things that that have a simple solution but you can't remember exactly what to do (or even not to do).

A person with experience can be a experienced non-expert. But with data, analysis and awareness of our biases and challenges in troubleshooting, we can get problems solved faster and with better accuracy. The future of IT troubleshooting will be based more and more on analytical approaches.


Do you have other tips for improving your troubleshooting and diagnostic skills?  Do you think we should get formal training in troubleshooting?


The Full Stack Engineer

Posted by SomeClown Mar 9, 2017

One of the hot topics in the software engineering world right now is the idea of what is being called the “full stack developer.” In this ecosystem, a full stack developer is someone who, broadly speaking, is able to write applications that encompass the front-end interface, the core logic, and the database backend. In other words, where it is common for engineers to specialize in one area or another, becoming very proficient in one or two languages, a full stack engineer is proficient in many languages, systems, and modalities so that they can, given the time, implement an entire non-trivial application from soup to nuts.


In the network engineering world, this might be analogous to an engineer who has worked in many roles over a lifetime, and has developed skills across the board in storage, virtualization, compute, and networking. Such an engineer has likely worked in mid-level organizations where siloes are not as prevalent or necessary as in larger organizations. Many times these are the engineers who become senior level architects in organizations, or eventually move into some type of consulting role, working for many organizations as a strategist. This is what I’ll call the full stack IT engineer.


While the skills and background needed to get to this place in your career likely put you into the upper echelon of your cohort, there can be some pitfalls. The first of which is the risk of ending up with a skillset that is very broad, but not very deep. Being able to talk abut a very wide scope and scale of the IT industry, based on honest, on the ground experience is great, but it also becomes difficult to maintain that deep level of skill in every area of IT. In the IT industry, however, I do not see this as a weakness per say. If you’ve gotten to this level of skill and experience, you are hopefully in a more consultative role, and aren’t being called on to put hands on keyboard daily. The value you bring at this level is that of experience, and the ability to see the whole chess board without getting bogged down in any one piece.


The other pitfall along the road to becoming a full stack engineer is the often overlooked aspect of training, whether on the job or on your own. If you are not absolutely dedicated to your craft, you will never, quite frankly, get to this level in your career. You’re going to be doing a daily job, ostensibly focusing on less than a broad spectrum of technologies. While you may move into those areas later, how do you learn today? And when you do move into other technologies, how to you keep the skills of today fresh for tomorrow? Honestly, the only way you’ll get there is to study, study, study pretty much all of the time. You have to become a full time student, and develop a true passion for learning. Basically, the whole stack has to become your focus, and if you’re only seeing part of it at work, you have to find the rest at home.


What does all of this have with Solarwinds and PerfStack? Simple: troubleshooting using a—wait for it—full stack solution is going to expose you to how the other half lives. Since PerfStack allows, and encourages, dashboards (PerfStacks) to be passed along as troubleshooting ensues, you should have some great visibility into the challenges and remedies that other teams see. If you’re a virtualization engineer and get handed a problem to work, presumably the network team, datacenter team, facilities, and probably storage have all had a hand in ascertaining where the problem does or does not lie. Pay attention to that detail, study it, ask questions as you get the opportunity. Make time to ask what certain metrics mean, and why one metric is more important than another. Ask the voice guys whether jitter or latency is worse for their systems, or the storage guys why IOPs matter. Ask the VM team why virtual machine mobility needs 10ms or less (generally) in link latency, or why they want stretched layer-2 between data centers.


It may seem banal to equate the full stack IT engineer with a troubleshooting product (even as great as PerfStack is) but the reality is that you have to take advantage of everything that is put in front of you if you want to advance your career. You’re going to be using these tools on a regular basis anyhow, why not take advantage of what you have? Sure, learn the tool for what it’s designed for, and learn your job to the best of your ability, but also look for opportunities like these to advance your career and become more valuable to both your team and the next team you’re on, whether at your current company or a new one down the road.

It was a busy week for service disruptions and security breaches. We had Amazon S3 showing us that, yes, the cloud can go offline at times. And we found out that out teddy bears may be staging an uprising. And we also found that Uber has decided to use technology and data to continue operating illegally in cities and towns worldwide. Not a good week for those of us that enjoy having data safe, secure, and available.


So, here's a handful of links from the intertubz I thought you might find interesting. Enjoy!


Data from connected CloudPets teddy bears leaked and ransomed, exposing kids' voice messages

The ignorance (or hubris) of the CloudPets CEO is on full display here. I am somewhat surprised that anyone could be this naive with regard to security issues these days.


Yahoo CEO Loses Bonus Over Security Lapses

Speaking of security breaches, Yahoo is in the news again. You might think that losing $2 million USD would sting a bit, but considering the $40 million she gets for running Yahoo into the ground I think she will be okay for the next few years, even with living in the Valley.


Hackers Drawn To Energy Sector's Lack Of Sensors, Controls

I'd like to think that someone, somewhere in our government, is actively working to keep our grid safe. Otherwise, it won't be long before we start to see blackouts as a result of some bored teenager on a lonely summer night.


Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

Thanks to a typo, the Amazon S3 service was brought to a halt for a few hours last week. In the biggest piece of irony, the popular website Is It Down Right Now? Website Down or Not? was, itself, down as a result. There's a lot to digest with this outage, and it deserves its own post at some point.


How Uber Deceives the Authorities Worldwide

I didn't wake up looking for more reasons to dislike how Uber is currently run as a business, but it seems that each week they reach a new low.


Thirteen thousand, four hundred, fifty-five minutes of talking to get one job

A bit long, but worth the read as it helps expose the job hiring process and all the flaws in the current system used by almost every company. I've written about bad job postings before, as well as how interviews should not be a trivia contest, so I enjoyed how this post took a deeper look.


If the Moon Were Only 1 Pixel - A tediously accurate map of the solar system

Because I love things like this and I think you should, too.


Just a reminder that the cloud can, and does, go offline from time to time:




Last week Amazon Web Services S3 storage in the East region went offline for a few hours. Since then, AWS has published a summary review of what happened. I applaud AWS for their transparency, and I know that they will use this incident as a learning lesson to make things better going forward. Take a few minutes to read the review and then come back here. I'll wait.


Okay, so it's been a few days since the outage. We've all had some time to reflect on what happened. And, some of us, have decided that now is the time to put on our Hindsight Glasses and run down a list of lingering questions and comments regarding the outage.


Let's break this down!


"...we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years."

This, to me, is the most inexcusable part of the outage. Anyone that does business continuity planning will tell you that annual checks are needed on such play books. You cannot just wave that away with, "Hey, we've grown a lot in the past four years and so the play book is out of date." Nope. Not acceptable.


"The servers that were inadvertently removed supported two other S3 subsystems."

The engineers were working on a billing system, and they had no idea that those billing servers would impact a couple of key S3 servers. Which brings about the question, "Why are those systems related?" Great question! This reminds me of the age-old debate regarding dedicated versus shared application servers. Shared servers sound great until one person needs a reboot, right? No wonder everyone is clamoring for containers these days. Another few years and mainframes will be under our desks.


"Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended."

But the command was allowed to be accepted as valid input, which means the code doesn't have any check to make certain that the command was indeed valid. This is the EXACT scenario that resulted in Jeffrey Snover adding the -WHATIF and -CONFIRM parameters into Powershell. I'm a coding hack, and even I know the value in sanitizing your inputs. This isn't just something to prevent SQL injection. It's also to make certain that as a cloud provider you don't delete a large number, or percentage, of servers by accident.


"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly."

So, they don't ever ask themselves, "What if?" along with the question, "Why?" These are my favorite questions to ask when designing/building/modifying systems. The 5-Whys is a great tool to find the root cause, and the use of "what if" helps you build better systems that help avoid the need for root cause reviews.


"We will also make changes to improve the recovery time of key S3 subsystems."

Why wasn't this a thing already? I cannot understand how AWS would get to the point that it would not have high availability already built into their systems. My only guess here is that building such systems costs more, and AWS isn't interested in things costing more. In the race to the bottom, corners are cut, and you get an outage every now and then.


"...we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3."

The AWS dashboard for the East Region was dependent upon the East Region being online. Just let that sink in for a bit. Hey, AWS, let me know if you need help with monitoring and alerting. We'd be happy to help you get the job done.


"Other AWS services in the US-EAST-1 Region that rely on S3 for storage...were also impacted while the S3 APIs were unavailable."

Many companies that rely on AWS to be up and running were offline. My favorite example is the popular website Is It Down Right Now? Website Down or Not? was itself, down as a result of the outage. If you migrate your apps to the cloud, you need to take responsibility for availability. Otherwise, you run the risk of being down with no way to get back up.


Look, things happen. Stuff breaks all the time. The reason this was such a major event is because AWS has done amazing work in becoming the largest cloud provider on the planet. I'm not here to bury AWS, I'm here to highlight the key points and takeaways from the incident to help you make things better in your shop. Because if AWS, with all of its brainpower and resources, can still have these flaws, chances are your shop might have a few, too. 

I have been talking about the complexity of resolving performance issues in modern data centers. I’ve particularly been talking about how it is a multi-dimensional problem. Also, that virtualization significantly increases the number of dimensions for performance troubleshooting. My report of having been forced to use Excel to coordinate brought some interesting responses. It is, indeed, a very poor tool for consolidating performance data.


I have also written in other places about management tools that are focused on the data they collect, rather than helping to resolve issues. What I really like about PerfStack is the ability to use the vast amount of data in the various SolarWinds tools to identify the source of performance problems.


The central idea in PerfStack is to gain insights across all of the data that is gathered by various SolarWinds products. Importantly, PerfStack allows the creation of ad hoc data collections of performance data. Performance graphs for multiple objects and multiple resource types can be stacked together to identify correlation. My favorite part was adding multiple different performance counters from the different layers of infrastructure to a single screen. This is where I had the Excel flashback, only here the consolidation is done programmatically. No need for me to make sure the time series match up. I loved that the performance graphs were re-drawing in real- time as new counters were added. Even better was that the re-draw was fast enough that counters could be added on the off chance that they were relevant. When they are not relevant, they can simply be removed.  The hours I wasted building Excel graphs translate into minutes of building a PerfStack workspace.


I have written elsewhere about systems management tools that get too caught up in the cool data they gather. These tools typically have fixed dashboards that give pretty overviews. They often cram as much data as possible into one screen. What I tend to find is that these tools are inflexible about the way the data is combined. The result is a dashboard that is good at showing that everything is, or is not, healthy but does not help a lot with resolving problems. The dynamic nature of the PerfStack workspace lends itself to getting insight out of the data, and helping identify the root cause of problems. Being able to quickly assemble the data on the load on a hypervisor and the VM operating system, as well the application statistics speeds troubleshooting. The ability to quickly add performance counters for the other application dependencies lets you pinpoint the cause of the issue quickly. It may be that the root cause is a domain controller that is overloading its CPU, while the symptom is a SharePoint server that is unresponsive.


PerfStack allows very rapid discovery of issue causes. The value of PerfStack will vastly increase as it is rolled out across the entire SolarWinds product suite.


You can see the demonstrations of PerfStack that I saw at Tech Field Day on Vimeo: NetPath here and SAM here.

As IT professionals, we have a wide variety of tools at our disposal for any given task. The same can be said for the attackers behind the increasing strength and number of DDoS attacks. The latest trend of hijacked IoT devices, like the Mirai Botnet, deserve a lot of attention because of their prevalence and ability to scale, mostly due to a lack of security and basic protections. This is the fault of both manufacturers and consumers. However, DDoS attacks at scale are not really a new thing, because malware-infected zombie botnets have been around for a while. Some fairly old ones are still out there, and attackers don’t forget their favorites.


One of the largest attacks in 2016 came in October, and measured in at 517 Gbps. This attack was not a complex, application-layer hack, or a massive DNS reflection, but a massive attack from malware that has been around for more than two years, called Spike. Spike is commonly associated with x86 Linux-based devices (often routers with unpatched vulnerabilities), and is able to generate large amounts of application-layer HTTP traffic. While Mirai and other IoT botnets remained top sources of DDoS traffic in 2016, they were not alone.




The complexity of these attacks continues to evolve. What used to be simple volumetric flooding of UDP traffic has moved up the stack over time. Akamai reports that between Q4 2015 and Q4 2016 there was a 6% increase in infrastructure layer attacks (layer 3 & 4), and a 22% increase in reflection-based attacks. At the same time, while overall web application attacks decreased, there was a 33% increase in SQLi attacks.


The application layer attacks become increasingly difficult to mitigate due to their ability to mimic real user behavior. They are more difficult to identify, and often have larger payloads. They are often combined with other lower-level attacks for variety and larger attack surface. This requires vigilance on the part of those responsible for the infrastructure we rely on, to protect against all possible attack vectors.




Not surprising is the fact that China and the United States are the primary sources of DDoS attacks, with China dominating Q1, Q2, and Q3 of 2016. The United States “beat” China in Q4 spiking to 24% of global DDoS traffic for that quarter. The increase in the number of source IP addresses here is dramatic, with the U.S. numbers leaping from about 60K in Q3 to 180K in Q4. This is largely suspected to be due to a massive increase in IoT (Mirai) botnet sources. Black Friday sales, perhaps?


While attacks evolve, become larger and more complex, some simple tried-and-true methods of disrupting the internet can still be useful. Old tools can become new again. Reports from major threat centers consistently show that Conficker is still one of the most prevalent malware variants in the wild, and it has been around since 2008.


Malware is often modeled after real biological viruses, like the common cold, and they are not easily eliminated. A handful of infected machines can re-populate and re-infect thousands of others in short order, and this is what makes total elimination a near impossibility.


There is no vaccine for malware, but what about treating the symptoms?


A concerted effort is required to combat the looming and real threat these DDoS attacks pose. Manufacturers of infrastructure products, consumer IoT devices, mobile phones, service providers, enterprise IT organizations, and even the government are on the case. Each must actively do their part to reinforce against, protect from, and identify sources of malware to slow the pace of this growing problem.


The internet is not entirely broken, but it is vulnerable to the exponential scale of the DDoS threat.

By Joe Kim, SolarWinds Chief Technology Officer


It’s time to stop treating data as a commodity and create a secure and reliable data recovery plan by following a few core strategies.


1. Establish objectives


Establish a Recovery Point Objective (RPO) that determines how much data loss is acceptable. Understanding acceptable risk levels can help establish a baseline understanding of where DBAs should focus their recovery efforts.


Then, work on a Recovery Time Objective (RTO) that shows how long the agency can afford to be without its data.


2. Understand differences between backups and snapshots


There’s a surprising amount of confusion about the differences between database backups, server tape backups, and snapshots. For instance, many people have a misperception that a storage area network (SAN) snapshot is a backup, when it’s really only a set of data reference markers. Remember that a true backup, either on- or off-site, is one in which data is securely stored in the event that it needs to be recovered.


3. Make sure those backups are working


Although many DBAs will undoubtedly insist that their backups are working, the only way to know for sure is to test the backups by doing a restore. This will provide assurance that backups are running — not failing — and highly available.


4. Practice data encryption


DBAs can either encrypt the database backup file itself, or encrypt the entire database. That way, if someone takes a backup, they won’t be able to access the information without a key. DBAs must also ensure that if a device is lost or stolen, the data stored on the device remains inaccessible to users without proper keys.


5. Monitor and collect data


Combined with network performance monitoring and other analysis software, real-time monitoring and real-time data collection can improve performance, reduce outages, and maintain network and data availability.


Real-time collection of information can be used to do proper data forensics. This will make it easier to track down the cause of an intrusion, which can be detected through monitoring.


Monitoring, database analysis, and log and event management can help DBAs understand if something is failing. They’ll be able to identify potential threats through things like unusual queries or suspected anomalies. They can compare the queries to their historical information to gauge whether or not the requests represent potential intrusions.


6. Test, test, test


If you’re managing a large database, there’s simply not enough space or time to restore and test it every night. DBAs should test a random sampling taken from their databases. From this information, DBAs can gain confidence that they will be able to recover any database they administer, even if that database is in a large pool. If you’re interested in learning more, check out this post, which gets into further detail on database sampling.


Data is quickly becoming a truly precious asset to government agencies, so it is critical to develop a sound data recovery plan.


Find the full article on our partner DLT’s blog, Technically Speaking.

I’ve long held the belief that for any task there are correct approaches and incorrect ones. When I was small, I remember being so impressed by the huge variety of parts my father had in his tool chest. Once, I watched him repair a television remote control, one that had shaped and tapered plastic buttons. The replacement from RCA/Zenith, I believe at the time, cost upwards of $150. He opened the broken device, determined that the problem was that the tongue on the existing button had broken, and rather than epoxy the old one back together, he carved and buffed an old bakelite knob into the proper shape, attached it in place of the original one, and ultimately, the final product looked and performed as if it were the original. It didn’t even look different than it had. This, to me, was the ultimate accomplishment. Almost as the Hippocratic Oath dictates, above all, do no harm. It was magic.


When all you have is a hammer, everything is a nail, right? But that sure is the wrong approach.


Today, my favorite outside work activity is building and maintaining guitars. When I began doing this, I didn’t own some critical tools. For example, an entire series of “Needle Files” and crown files are appropriate for the shaping and repair of frets on the neck. While not a very expensive purchase, all other tools would fail in the task at hand. The correct Allen wrench is necessary for fixing the torsion rod on the neck. And the ideal soldering iron is critical for proper wiring of pickups, potentiometers, and the jack. Of course, when sanding, a variety of grades are also necessary. Not to mention, a selection of paints, brushes, stains, and lacquers.


The same can be said of DevOps. Programming languages are designed for specific purposes, and there have been many advances in the past few years pointing to what a scripting task may require. Many might use Bash, batch, or PowerShell to do their tasks. Others may choose PHP or Ruby on Rails, while still others choose Python as their scripting tools. Today, it is my belief that no one tool can accommodate every action that's necessary to perform these tasks. There are nuances to each language, but one thing is certain: many tasks require the collaborative conversation between these tools. To accomplish these tasks, the ideal tools will likely call functions back and forth from other scripting languages. And while some bits of code are required here and there, currently it's the best way to approach the situation, given that many tools don't yet exist in packaged form. The DevOps engineer, then, needs to write and maintain these bits of code to help ensure that they are accurate each time they are called upon. 


As correctly stated in comments on my previous posting, I need to stress that there must be testing prior to utilizing these custom pieces of code to help ensure that other changes that may have taken place within the infrastructure are accounted for each time these scripts are set to task.


I recommend that anyone who is in DevOps get comfortable with these and other languages, and learn which do the job best so that DevOps engineers become more adept at facing challenges.


At some point, there will be automation tools, with slick GUI interfaces that may address many or even all of these needs when they arise. But for the moment,  I advise learning, utilizing, and customizing scripting tools. In the future, when these tools do become available, the question is, will they surmount the automation tools you’ve already created via your DevOps? I cannot predict.

As you spend more time in security, you start to understand that keeping up with the latest trends is not easy. Security is a moving target, and many organizations simply can’t keep up. Fortunately for us, Cisco releases an annual security report that can help us out in this regard. You can find this year's report, as well as past reports, here. In this post, I wanted to share a few highlights that illustrate why I believe security professionals should be aware of these reports.


Major findings

A nice feature of the Cisco 2017 Annual Cyber Security Report is the quick list of major findings. This year, Cisco notes that the three leading exploit kits -- Angler, Nuclear, and Neutrino --  are vanishing from the landscape. This is good to know, because we might be spending time and effort looking for these popular attacks while other lesser-known exploit kits start working their way into the network. And based on Cisco’s findings, most companies are using several security vendors with more than five security products in their environment, and only about half of the security events received in a given day are reviewed. Of that number, 28% are deemed legitimate, and less than half that number are remediated. We’re having a hard time keeping up, and our time spend needs to be at a live target, not something that’s no longer prevalent.


Gaining a view to adversary activity

In the report's introduction, Cisco covers the strategies that adversaries use today. These include taking advantage of poor patching practices, social engineering, and malware delivery through legitimate online content, such as advertising. I personally feel that you can't defend your network properly unless you know how you’re being attacked. I suppose you could look at it this way. Here in the United States, football is one of the most popular sports. It’s common practice for a team to study films of their opponents before playing them. This allows them to adjust their offensive and defensive game plan ahead of time. The same should be true for security professionals. We should be prepared to adjust to threats, and reviewing Cisco’s security report is similar to watching those game films.


In the security report, Cisco breaks down the most commonly observed malware by the numbers. It also discusses how attackers pair remote access malware with exploits in deliverable payloads. Some of what I gleaned from the report shows that the methods being used are the same as what was brought out in previous reports, with some slight modifications.


My take

From my point of view, the attacks are sophisticated, but not in a way that’s earth shattering. What I get from the report is that the real issue is that there are too many alerts from too many security devices, and security people can't sort through them efficiently. Automation is going to play a key role in security products. Until our security devices are smart enough to distinguish noise from legitimate attacks, we’re not going to be able to keep up. However, reading reports like this can better position our security teams to look in the right place at the right time, cutting down on some of the breaches we see. So, to make a long story short, be sure to read up on the Cisco Annual Security report. It’s written well, loaded with useful data, and helps security professionals stay on top of the security landscape.

In our pursuit of Better IT, I bring you a post on how important data is to functional teams and groups. Last week we talked aboutnti-patterns in collaboration, covering things like data mine-ing and other organizational dysfunctions. In this post we will be talking about the role shared data, information, visualizations, and analytics play in helping ensure your teams can avoid all those missteps from last week.


Data! Data! Data!

These days we have data. Lots and lots of data. Even Big Data, data so important we capitalize it!. As much as I love my data, we can't solve problems with just raw data, even if we enjoy browsing through pages of JSON or log data. That's why we have products like NPM Network Performance Monitor Release Candidate , SAM Server & Applications Monitor Release Candidate and DPADatabase Performance Analyzer RC,  to help us collect and parse all that data.  Each of those products have specialized metrics they collect, meaning they apply to them and visualizations to help specialized SySadmins to leverage that data. These administrators probably don't think of themselves as data professionals, but they are. They choose which data to collect, which levels to be alerted on, and which to report upon. They are experts in this data and they have learned to love it all.

Shared Data about App and Infrastructure Resources

Within the SolarWinds product solutions, data about the infrastructure and application graph is collected and displayed on the Orion Platform. This means that cross-team admins share the same set of resources and components and the data about their metrics. Now we havePerfStack Livecast with features to do cross-team collaboration via data. We can see entities we want to analyze, then see all the other entities related them. This is what I call the Infrastructure and Application Graph, which I'll be writing about later. After choosing Entities, we can discover the metrics available for each of the entities and choose the ones that make the most sense to analyze based on the troubleshooting we are doing now.




Metrics Over Time


Another data feature that's critical to analyzing infrastructure issues is the ability to see data *over time." It's not enough to know how CPU is doing right now, we need to know what it was doing earlier today, yesterday, last week, and maybe even last month, on the same day of the month. By having a view into the status of resources over time, we can intelligently make sense of the data we are seeing today. End-of-month processing going on? Now we know why there might be slight spike in CPU pressure.


Visualizations and Analyses


The beauty of Perfstack is that by choosing these Entities and metrics we can easily build data visualizations of the metrics and overlay them to discover correlations and causes. We can then interact with the information we now have by working with the data or the visualizations. By overlaying the data, we can see how statuses of resources are impacting each other. This collaboration of data means we are performing "team troubleshooting" instead of silo-based "whodunits." We can find the issue, which until now might have been hiding in data in separate products.




So we've gone from data to information to analysis in just minutes. Another beautiful feature of PerfStack is that once we've built the analyses that show our troubleshooting results, we can copy the URL, send it off to team members, and they can see the exact same analysis -- complete with visualizations -- that we saw. If we've done similar troubleshooting before and saved projects, we might be doing this in seconds.

Save Project.png

This is often hours, if not days, faster than how we did troubleshooting in our previous silo-ed, data mine-ing approach to application and infrastructure support. We accomplished this by having quick and easy access to shared information that united differing views of our infrastructure and application graph.


Data -> Information -> Visualization -> Analysis -> Action


It all starts with the data, but we have to love the data into becoming actions. I'm excited about this data-driven workflow in keeping applications and infrastructure happy.

Needles and haystacks have a storied past, though never in a positive

sense. Troubleshooting network problems comes as close as any to the

process to which that pair alludes. Sometimes we just don't know what we

don't know, and that leaves us with a problem: how do we find the

information we're looking for when we don't know what we're looking for?



The geeks over at Solarwinds obviously thought the same thing and decided

to do something to make life easier for those hapless souls frequently

finding themselves tail over teakettle in the proverbial haystack; that

product is PerfStack.


PerfStack is a really cool component piece of the Orion Platform as of

the new 12.1 release. In a nutshell, what it allows you to do is to find

all sorts of bits of information that you're already monitoring, and view

it all in one place for easy consumption. Rather than going from this page

to that, one IT discipline-domain to another, or ticket to ticket,

PerfStack gives you more freedom to mix and match, to see only the bits

pertinent to the problem at hand, whether those are in the VOIP systems,

wireless, applications, or network. Who would have thought that would be

useful, and why haven't we thought of that before?




In and of itself, those features would be a welcome addition to the Orion

suite--or any monitoring suite, for that matter--but Solarwinds took it

one step further and designed PerfStack in such a way that you can create

your own "PerfStacks" on the fly, as well as passing them around for other

people to use. Let's face it, having a monitoring solution with a lot of

canned reporting, stuff that just works right out of the box, is a great

thing, but having the flexibility to create your own reports at a highly

granular level is infinitely better. Presumably you know your environment

better than the team at Solarwinds, or me, or anyone else. You

shouldn't be forced into a modality that doesn't fit your needs.


Passing dashboards ("PerfStacks") around to your fellow team members, or

whomever, is really a key feature here. Often we have a great view to the

domain we operate within, whether that's virtualization, applications,

networking, storage; but we don't have the ability to share

that with other people. That's certainly the case with point products, but

even when we are all sharing the same tools it's not historically been as

smooth a process as it could be. That's unfortunate, but PerfStack goes

a long way toward breaking through that barrier.




There are additional features to PerfStack that bear mentioning: real-time updates to dashboards without redrawing the entire screen, saving of the dashboards, importing in real-time of new polling targets/events, etc. I will cover those details next time, but what we've talked about so far should be enough to show the value of the product. Solarwinds doesn't seem to believe in tiny rollouts. They've come out of the gates fast and strong with this update, and with good reason. It really is a great and useful product that will change the way you look at monitoring and troubleshooting.

Filter Blog

By date:
By tag: