Success. It marks the subtle difference between being productive and being busy. WordStream and MobileMonkey founder, Larry Kim, eloquently wrote about the 11 differences between busy people and productive people in a recent Inc. article. It is a great read that offers an interesting take on productivity. For instance, one of the eleven differences that Larry calls out is that productive people have a mission for their lives, while busy people simply look like they have a mission. The key is to correctly identify your purpose and the corresponding work that will fulfill your life's mission. There is no template for your mission because only you can define those core policies. Otherwise, it's someone else's mission. In the latter instance, you have less understanding of what "good" should look like, therefore you will be less efficient and effective in your work. 


So how do the productive versus busy insights play out in IT environments? Let's take the example of automation, one of the DART-SOAR skills. Many pundits believe that automation's objective is to save time--to do more stuff. This is what it means to be busy. In actuality, automation's true aim is not to save time, but rather to improve consistency of delivery, reliability of delivered services, and a normalized experience at scale. This exemplifies what it means to translate automation into productivity and deliver meaningful value.


Translating your skills, experience, and expertise into business value is how you make your career as a professional. Without business value, you won't have value.


Let me know what you think below in the comment section.

Logic and objective thinking are hallmarks of any engineering field. IT design and troubleshooting are no exceptions. Computers and networks are systems of logic so we, as humans, have to think in such terms to effectively design and manage these systems. The problem is that the human brain isn’t exactly the most efficient logical processing engine out there. Our logic is often skewed by what are called cognitive biases. These biases take many potential forms, but ultimately they skew our interpretation of information in one way or another. This leaves us believing we are approaching a problem logically, but in reality are operating on a distorted sense of reality.


What am I talking about? Below are some common examples of cognitive biases that I see all the time as a consultant in enterprise environments. This is by no means a comprehensive list. If you want to dig in further, Wikipedia has a great landing page with brief descriptions and links to more comprehensive entries on each.


Anchoring: Anchoring is when we value the information we learn first as the most important, with subsequent learned information having less weight or value. This is common in troubleshooting, where we often see a subset of symptoms before understanding the whole problem. Unless you can evaluate the value of your initial information against subsequent evidence, you’re likely to spin your wheels when trying to figure out why something is not working as intended.


Backfire effect: The backfire effect is what happens when someone further invests into an original idea or hypotheses, even when new evidence is learned that disproves the initial belief. Some might call this pride, but ultimately no one wants to be wrong even if it’s justifiable because all evidence wasn’t available when forming the original opinion or thought. I’ve seen this clearly demonstrated in organizations that have a blame-first culture. Nobody wants to be left holding the bag, so there is more incentive to be right than to solve the problem.


Outcome bias: This bias is our predisposition to judge a decision based on the outcome, rather than how logical of a decision it was at the time it was made. I see this regularly from insecure managers who are looking for reasons for why things went wrong. It plays a big part in blame culture. This can lead to decision paralysis when we are judged by outcomes we can’t control, rather than a methodical way of working through an unknown root cause.


Confirmation bias: With confirmation bias, we search for, and ultimately give more weight to, evidence that supports our original hypotheses or belief of the way things should be. This is incredibly common in all areas of life, including IT decision making. It reflects more on our emotional need to be right than any intentional negative trait.


Reactive devaluation: This bias is when someone devalues or dismisses an opinion not on merit, but on the fact that it came from an adversary, or someone you don’t like. I’m sure you’ve seen this one, too. It’s hard to admit when someone you don’t respect is right, but by not doing so, you may be dismissing relative information in your decision-making process.


Triviality/Bike shedding: This occurs when extraordinary attention is applied to an insignificant detail to avoid having to deal with the larger, more complex, or more challenging issue. By deeply engaging in a triviality, we feel like we provide real value to the conversation. The reality is that we expend cycles of energy on things that ultimately don’t need that level of detail applied.


Normalcy bias: This is a refusal to plan for or acknowledge the possibility of outcomes that haven’t happened before. This is common when thinking about DR/BC because we often can’t imagine or process things that have never occurred before. Our brains immediately work to fill in gaps based off our past experiences, leaving us blind to potential outcomes.


I point out the above examples just to demonstrate some of the many cognitive biases that exist in our collective way of processing information. I’m confident that you’ve seen many of them demonstrated yourself, but ultimately, they continue to persist because of the most challenging bias of them all:


Bias blind spot: This is our tendency to see others as more biased than ourselves, and not being able to identify as many cognitive biases in our own actions and decision making. It’s the main reason many of these persist even after we learn about them. Biases are often easy to identify when others demonstrate them, but we often can’t see our own biases when our thinking is being impacted by a bias like those above. The only way to identify our own biases is through an honest and self-reflective post mortem of decision making, looking specifically for areas where our bias impacted our view of reality.


Final Thoughts


Even in a world dominated by objectivity and logical thinking, cognitive biases can be found everywhere. It’s just one of the oddities of the human condition. And bias affects everyone, regardless of intent. If you’ve read the list above and have identified a bias that you’ve fallen for, there’s nothing to be ashamed of. The best minds in the world have the same flaws. The only way to overcome these biases is to inform yourself of them, identify which ones you typically fall prey to, and actively work against those biases when trying to approach a subject objectively. It’s not the mere presence of bias that is a problem. Rather, it’s the lack of awareness of bias that leads people to incorrect decision making.

The sweet 16 was bitter for some. It looks like we aren’t going to see any Cinderella stories in this year’s bracket battle.  The last of the little guys literally had to face off with a Dragon, and all I can say is don’t play with fire if you don’t want to get burned #amirite.


It’s going to be an uphill battle for our elite 8, no easy matchups here!


Let’s take a look at who came out on top in round 2:



  • Cryptids round 2: Thunderbird vs Yeti Despite the all the cheering coming from the comment section, this was not an easy win for the Thunderbird. tinmann0715 thinks Thunderbird has what it takes to.go.all.the.way! “My prediction is ringing true for the underdog Thunderbird. It is becoming my "Dark Horse" favorite for the rest of the tourney.”
  • Cryptids round 2: Chupacabra vs Loch Ness Monster Nessie is still in it after a tough match with Chupacabra. It looks like it came down to a battle of “Which one is scarier” rschroeder  “Nessie vs. Chupacabra.  What's scarier--a fish-eating dinosaur or a man/lizard that interacts harmlessly with goats?  OK, a man/lizard probably will generate more nightmares. Which would defeat the other in battle?  I'm pretty certain Nessie could out-chomp and crush Chupie.  Loch Ness for the win.”
  • Half & Halfs round 2: Griffin vs Pegasus This one was almost too close to call! Judging by your comments no one was certain who to root for since these two were so evenly matched.
  • Half & Halfs round 2: Minotaur vs Centaur Again another half & halfs match that could have gone either way. asheppard970 said it better than I could, “I think this is coming down to a "brains vs. brawn" battle, and brawn appears to be winning.”
  • Gruesomes round 2: Medusa vs Werewolf The #GRLPWR was strong with this one. Medusa and her stony stare advance to the next round!



  • Gruesomes round 2: Vampire vs Kraken Based on the results of this match, the Vampire needs to hire a new PR agent that doesn’t suck. cdow2011 “The kraken and Perseus...still a better love story than Twilight.”
  • Fairy tales round 2: Leprechaun vs Dragon A true David and Goliath match, but our pint-sized friend from the Emerald Isle was no match for brute strength and power of the Dragon.  caleyjay7 “Dragon: Teeth, Claws, tail-whip, potentially fire breathing... Leprechaun: general tomfoolery and lucky charms...I know who my money's on.”
  • Fairy tales round 2: Banshee vs Phoenix There was some serious debate about how outside influence from pop-culture affected the results of this match. Phoenix manages to run fly away with this one in the face of conflict.


Were you surprised by any of the shutouts or nail-biters for this round? Comment below!


It’s time to check out the updated bracket & start voting in the ‘Fantastical’ round! We need your help & input as we get one step closer to crowning the ultimate legend!


Access the bracket and make your picks HERE>>

It's the last week of March, which means the year is about 25% complete. Time to check in on your New Year's goals and see how you have progressed. There is still time to follow through on those promises you made to yourself.


As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!


12 Things Everyone Should Understand About Tech

Understanding these twelve things is crucial if we want to work together to make tech better for everyone.


Expedia's Orbitz Suspects 880,000 Payment Cards Stolen

“Orbitz says the breached system is not part of its current website.” In other words, they weren’t hacked through the website, they let their data get stolen because they lacked proper internal security measures. But here’s the #hardtruth: They are not unique, many companies fail in this area, they just don’t know it yet.


AVA: The Art and Science of Image Discovery at Netflix

Ever wonder how Netflix decides what images to use? Meet AVA, the brains behind the machine.


Ex-Googler Wants to Upend Pigs and Hotels With the Blockchain

Finally, a practical use case for Blockchain! I can’t tell you how many times I’ve had issues getting quality bacon delivered to my hotel room.


Silicon Valley Has Failed to Protect Our Data. Here’s How to Fix It

I love this idea except for one detail, and that is I don’t want the government to have any part in this effort. They move too slow, and are often at the bequests of lobbyists. Seems like something Bill Gates could put a billion dollars behind and create something more useful than anything Congress would do.


Facebook denies it collects call and SMS data from phones without permission

I want to believe Facebook here, but, well, they haven’t exactly demonstrated that they can be trusted with our data. It’s quite possible that such data was collected, but not in an official capacity. So they can deny they are doing it, which is not the same as saying it never happened.


Ford vending machine begins dispensing cars in China

I love this idea, but I’d love it more if it were full of Jeeps.


Nothing makes a meeting more fun than showing up wearing a Luchador mask:

By Paul Parker, SolarWinds Federal & National Government Chief Technologist


As more agencies adopt the cloud, here's an applicable article from my colleague Joe Kim, in which he suggests achieving a balance between security and efficiency.


Federal agencies are using the cloud more than ever before, but they’re also not ready to abandon the safety and security of their on-premises infrastructures. That’s the message sent by survey respondents who participated in SolarWinds’ 2017 IT Trends Report.


The annual federal IT pulse check—which is based on feedback from public sector IT practitioners, managers, and directors—indicated a marked increase in cloud adoption over the past year. Ninety-six percent of survey respondents stated that they have migrated critical applications and IT infrastructure to the cloud over the past year. The migration was driven by the potential of increased return on investment, cost efficiency, availability and reliability. Fifty-eight percent of survey respondents believe they have received most, if not all, of the benefits they expected from their cloud migrations.


But no one ever said this cloud thing was going to be easy or clear cut. A substantial number of respondents—29 percent—stated they have actually brought applications and infrastructure back on-premises after having initially moved them to the cloud. Their reasons included concerns over security and compliance (45 percent of respondents), poor performance (14 percent), and technical challenges with their migrations (14 percent).


As a result, hybrid IT infrastructures are thriving. Many agencies continue to be uncomfortable with migrating all of their infrastructure or applications to the cloud. The facets of their environments that are security-sensitive, for example, are for the most part remaining on-premises.


There’s no indication that these agencies will be embracing an all-cloud IT infrastructure anytime soon. According to the survey, a large percentage of organizations (37 percent) report hosting one to nine percent of their infrastructures entirely in the cloud, while just one percent of respondents said all of their infrastructures are hosted in the cloud.


Some other interesting points of note:


  • 40 percent of respondents said their organizations spend 70 percent or more of their annual IT budgets on on-premises (traditional) applications and infrastructure
  • 62 percent indicated that the existence of the cloud and hybrid IT have had at least somewhat of an impact on their careers, while 11 percent said they have had a tremendous impact
  • 65 percent said their organizations use up to three cloud provider environments


All of these findings point to some clear recommendations. Managers must implement pervasive monitoring strategies that provide complete visibility into their entire network and all applications, both on-premises and off. Security, compliance, and performance should be just as important as cost efficiencies when considering cloud migration. IT professionals must continue to hone their cloud skills and be open and agile in adopting best-of-breed cloud and hybrid IT elements. And agencies should elect to work with trusted cloud vendors that are willing to provide federal IT professionals with control and visibility over their hosted workloads.


This is just a snapshot of where things stand in 2017. The full report contains a more complete picture of a federal IT world that continues to move to the cloud, but isn’t quite ready to fully commit.


Find the full article on Nextgov.

Anyone else already #BracketBusted?

If you came to root for the underdogs in this year’s bracket battle, you are going to be sorely disappointed.

Across the board the titans of the bracket stomped out the little guys.


Play-In Round: Cerberus vs. Anansi

Dog v spider, sounds like the title of a viral YouTube video, no? In the end, the arachnid didn’t stand a chance against the hound of Hades.


Let’s take a look at how our legends fared in round 1, shall we?





  • Cryptids Round 1: Yeti vs. Bigfoot - This one really could have gone either way as these two opponents were the most equally matched pair of round 1. The abominable snowman left our forested friend with frostbite and manages to roll into round 2.
  • Half & Halfs Round 1: Hippogriff vs. Pegasus - This half & halfs match-up was nearly split 50/50! It came down to a photo finish in favor of Pegasus!
  • Half & Halfs Round 1: Manticore vs. Minotaur - rschroeder's commentary explains where it all went downhill for the Manticore: “Manticore is the bigger threat.  But the Minotaur has more terror and less horror, given it's half man / half male bovine, lives in a maze that you'd never find your way out of before it got you, and the depictions I've seen have all imagined the Minotaur's Labyrinth to be all in the dark.  There's nothing quite like knowing there's a scary thing in the dark hunting you to increase your terror . . .”
  • Fairy Tales Round 1: Troll vs. Banshee - I really can’t argue with zennifer’s statement on this one “Yeah.. you need to be afraid of anything with a SHE in its name!” Though this was a close call, the Banshee earned her victory shrieks this round.


Were you surprised by any of the shutouts or nail-biters for this round? Comment below!


It’s time to check out the updated bracket & start voting in the ‘Unbelievable’ round! We need your help & input as we get one step closer to crowning the ultimate legend!


Access the bracket and make your picks HERE>>

One of the biggest complaints you'll likely hear after moving to Office 365 will be about its speed, or lack thereof. Now that data isn't sitting on your LAN, there is lots of room for latency to hit your connection. There’s no doubt that your users will alert you to the problem in short order. So what can you do about it?




If speed is the primary concern, one of the first things you should do is get a baseline. If someone is complaining that performing a task is slow, how long is it taking? Minutes? Seconds? When it comes to making improvements, you need a way to ensure that changes are having a positive impact. In the case of Skype for Business, Microsoft actually has a tool to help assess your network.


Along with speed, you'll want to be able to figure out where the problem lies. Now that large amounts of your data are in the cloud, you'll have a lot more WAN traffic. Be sure to check your perimeter devices. With the increased volumes, these could easily be your bottlenecks. If the congestion lies past your perimeter, you can take a look at Azure ExpressRoute. Using this, you can create a private connection to Azure data centers, for a price.




Although speed will likely be one of the first and loudest complaints, you'll also want to monitor availability. Microsoft offers service dashboards when you log into the portal, but you should also consider third-party monitoring solutions. Some of these solutions can regularly check SMTP to make sure it is accepting mail, or routinely make sure that your DNS is properly configured.


Routine checks like these can help keep the environment healthy. The benefit of going with a service for these sorts of checks is that they can alert you fairly quickly. Also, you won't need to remember to actually do it yourself. Be sure to know what your SLA terms are as well—depending on what sort of downtime you are seeing, you may qualify for credits.




Office 365 is a ripe target for hackers, plain and simple. Phishing attempts are the perfect attack vector because users might be used to logging in with their credentials on a regular basis. The point I’m getting at is that you’ll want to make sure you consider security when putting together a monitoring plan.


Office 365 has a Security and Compliance Center, which is a great place to start securing your environment. You define known IP ranges or audit user mailbox logins, and from what IP. Once again, there are plenty of third-party services that can yield additional reporting that isn't available "out of the box" (or should that be "out of the cloud").




In smaller environments, a lot of folks wear multiple hats. Reporting tools can quickly get folks the information they need. In larger environments, there are usually multiple teams involved. Similar to a point made in an earlier post, knowing who should be aware of problems is key. This also applies to users. If your monitoring tells you that a large portion of your users' mailboxes are offline, what's your plan to alert them?


Being able to monitor your environment's health is one thing, but taking actions is another. This doesn't just apply to Office 365. Hopefully, these past few posts and the fantastic comments from the community have helped with planning out a smooth migration. But don’t forget to also plan for disaster.

It’s March Madness time here in the USA. I love this time of year. Not just because I have a former life as a player and coach, but because here in the northeast it is that time when winter finally gives way to spring.


As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!


Project Nimble: Region Evacuation Reimagined

This one provides insight into how Netflix was able to cut their failover time from 50 minutes down to 8. Next time someone tells me that they have too much data to failover quickly, I’m just going to point them to this post.


Microsoft’s Adding new Data Centers in Europe and the Middle East

Slowly adding more capacity and offering local data storage in regions that have strict laws. Microsoft is buying entire data centers from companies and converting them to Azure in the biggest "lift and shift" imaginable.


The Reason Software Remains Insecure

Short and direct to the point. This is why we can’t have nice things.


Facebook confirms Cambridge Analytica stole its data; it’s a plot, claims former director

This story is going to get ugly, and fast. It’s also a nice reminder that the most valuable asset anyone can have is not money, it’s data. Facebook owns a lot of data, offered up for free from users. The same users that annoy you with Farmville invites.


To find suspects, police quietly turn to Google

You know who else has a lot of data? Google, that’s who. Your phone will track you even when you think it is not, and the police can get data from Google to find out who was near the scene of a crime. This isn’t Minority Report stuff yet, but it is getting us closer.


Your Data is Being Manipulated

And since I’m clearly on a data-centric theme this week, let me share with you this article and the following quote: “In short, I think we need to reconsider what security looks like in a data-driven world.” Yes, yes we do.


Google plans to boost Amazon competitors in search shopping ads

Competition is good, right? The best part about this is the fact that Google is being open about what they are doing, which is different than what they have done in the past.


My high school made it to their first-ever state championship basketball game this past Saturday. There was no question I would attend wearing my old varsity jacket:

So, I wanted to at least touch base with everyone on the “scandal” of the week. Is it fake news? New ways for stock gouging? New ransom type embankments? Corporate espionage?


I waited until at least some of the dust had settled to write this post. I wanted to be able to make accurate judgment calls and present a level-headed offering of thoughts and ideas. Here they are:


  1. Yes, there are security flaws (over a dozen) within these processors.
  2. No, at this time they are not mission critical because they have to have physical access AND the administrator\root information.
  3. The lab that sent out these security flaws had stock associated with their finds.
  4. They only gave AMD 24 hours to resolve the issue before they sent the processors out.


People are still discussing the processor story, so consider this an up-to-date discussion. Let it also be a friendly reminder that we have to check the general “sky is falling” mentality, especially in security. Key takeaway? Focus on best practices.



We should strive to have due diligence on the risk, determine appropriate measures to respond, and showcase the balance between risk and business as usual.


Since I believe you can benefit from them, here are my top three security practices:


Infrastructure monitoring

Determining baselines winds up bringing incredible value to any organization, department, and technology as a whole. The importance and power of baselines sometimes gets overlooked, and that saddens me. It is all too common for folks to wait until after they experience an incident to set up monitoring. That is simply a reaction, not a proactive approach.


Once you begin monitoring, you can start comparing solutions to risk. This is how you can test solutions to risks and vulnerabilities before you go full on “PLAID” mode (Spaceballs reference. #sorrynotsorry), only to find that you have created a larger issue than the risk itself. Comparative reporting is an excellent way to prove that you have done your due diligence in understanding the impact of the threat and the solution as a whole.


Threat management policies

You should determine a policy that addresses ways to deal with threats, vulnerabilities, and concerns immediately and openly.  It should live where everyone can access it, and be clearly outlined so everyone knows what is happening even before you have the solution. This helps to stop or at least slow down management fire alarms, universally expressed as, “What are we going to do NOW?”


The policy should include a timeline of events that everyone can understand. For example, let everyone know that there will be an email update outlining next steps with 48 hours of the incident.  In other words, you are telling everyone, “ Hey, I’m working on the issue and I’ll make sure I update you. In the meantime, I’m doing my due diligence to make sure the outcome is beneficial for our company.”


Asset Management

You can't quickly assess your infrastructure if you are not aware of everything you manage, period.


There is power in knowing what you are managing many realms, but my first go-to are asset reports. I need to know quickly what could—and, more importantly—what could not be associated with any new threats, concerns, or vulnerabilities.


The types of tools that allow me to monitor and update my assets give me much needed insight into where my focus should be, which is why I go there first. Doing so ensures that I won’t be distracted or overwhelmed by data points that aren’t relevant.


Finally, the responsibility of tracking and understanding any types of threat should be proactive and fully vetted. We should want to understand the issues before we blindly implement Band-Aids that can, potentially, hinder our business goals.


Using information to better the security within our organizations also brings us into the fabric of the business, assisting efforts to keep business costs low.


I hope you join this conversation because there are several touch points here. I’m very curious to hear your thoughts, comments, and opinions. For example, did you believe, when the processors were released, that they were a form of ransom? Do you see other opportunities to manhandle a company’s earnings by highlighting exploits for others’ gain?  Or, maybe you just sit back, watch the news with a scotch in your hand, and laugh.


Let's talk this over, shall we?




The SolarWinds trademarks, service marks, and logos are the exclusive property of SolarWinds Worldwide, LLC or its affiliates. All other trademarks are the property of their respective owners.

With the influx of natural disasters, hacks, and increasingly more common ransomware, being able to recover from a disaster is quickly moving up the priority list for IT departments across the globe. In addition to awareness, we are seeing our data centers move from a very static deployment to an ever-changing environment. Each day we see more and more applications getting deployed, either on-premises or in the cloud, and each day we, as IT professionals, have the due diligence to ensure that when disaster strikes we can recover these applications. Without the proper procedures in place to consistently update our DR plans, no matter how well-crafted or detailed they are, the confidence in completing successful failovers decreases. So what now?


We’ve already discussed the first step in our DR process: creating our plan. We’ve also touched on the second step, which is to make it a living document to accommodate for data center change. But there is one more step we need to put in place for a successful failover, and that's testing. It boosts the confidence in the IT department and the organization as a whole.


Testing our DR plan - We learn by doing!


When thinking of DR plan testing, I always like to compare it to a child. I know, a weird analogy, but if we think about how children learn and get better, it begins to make sense. Children learn by doing; they learn to talk by talking, learn to play sports by playing, etc. The point is that by “walking the walk,” we tend to improve ourselves. The same applies to our DR plans. We can have as many details and processes laid out on paper as we want, but if we can't restore when we need to, we've failed. Essentially, our DR plans are set up for success by also walking the walk, aka testing.


Start small, get bigger!


I’m not recommending going and pulling the plug on your data center tomorrow to see if your plan works. That would certainly be a career-changing move. Instead, you should start small. Take a couple key services as defined in your DR plan and begin to draft a plan on how to test a failover of the components and servers contained within them. Just as when creating our DR plan, details and coordination are the key to success when creating our testing plan.  Know exactly what you are testing for. Don’t simply acknowledge that the servers have booted as a success. Instead, go deeper. Can you log into the application? Can you use the application? Can a member of the department that owns the application sign off stating that it is indeed functioning normally? By knowing exactly what the end goal is you can sign off on a successful test, or on the flip side, take the failures which have occurred and learn from them, updating our plan to reflect any changes, and be prepared for the next testing cycle.


Once you have a couple services defined go ahead and begin to integrate more and ensure that recurring time has been set aside and defined within the DR plan to carry out these tests. A full-scale DR test is not something that can be performed on a regular basis, but we can carry out smaller tests on a monthly or quarterly basis. Without a consistent schedule and attention to detail we can almost guarantee that items like configuration drift will soon creep up and cause our DR testing to fail, or worse, our DR execution to fail.


I’ve mentioned before that not keeping our DR plans up to date is perhaps one the biggest flaws in the whole DR process. However, not applying a consistent testing plan trumps this. Disaster Recovery, in my opinion, cannot be classified as a project. It cannot have an end date and a closing. We must always ensure, when deploying new services and changing existing applications, that we revisit the DR plan, updating both the process of recovering and the process for testing said recovery. Testing our DR plan is a key component in ensuring that all that work we have done in creating our plan will be successful when the plan is most needed. Let’s face it. A failed recovery will put a blemish on the entire DR planning process and all the work that has gone into it. Test and test often to make sure this doesn't happen to you.


I’d love to hear from all of you regarding how you go about testing, or if you even do? Are there any specific starting points for tests that you recommend? Do you start small and then expand? Do you utilize any specific pieces of software, resources or tools to help test your recovery? If you do test, how often? And finally, let’s hear those horror/success stories of any incidents gone bad (or extremely well) as it relates directly to your DR testing procedures. Thanks for reading!

By Paul Parker, SolarWinds Federal & National Government Chief Technologist


We all know that security concerns go hand in hand with IoT. Here's an interesting article from my colleague Joe Kim, in which he suggests ways to overcome the challenges.


Agencies should not wait on IoT security


The U.S. Defense Department is investing heavily to leverage the benefits provided by the burgeoning Internet of Things (IoT) environment.


With federal IoT spending already hitting nearly $9 billion in fiscal year 2015, according to research firm Govini, it’s a fair bet that IoT spending will continue to increase, particularly considering the department’s focus on arming warfighters with innovative and powerful technologies.


Security risks exist that must not be overlooked. An increase in connected devices leads to a larger and more vulnerable attack surface offering a greater number of entry points for bad actors to exploit.


While the BYOD wave might have been good prep for a connected future, the IoT ecosystem will make managing smartphones and tablets seem like child’s play. To quote my colleague Patrick Hubbard, “IoT is a slowly rising tide that will eventually make IoT accommodation strategies pretty quaint.” That’s because we are talking about many proprietary operating systems that will need to be managed individually.


DHS has acknowledged the problems that the IoT presents and the opportunity to address security challenges. Furthermore, the DoD is making significant strides to fortify the government’s IoT deployments. In addition to DoD’s overall significant investment in wireless devices, sensors and cloud storage, the NIST has issued an IoT model designed to provide researchers with a better understanding of the ecosystem and its security challenges.


The government IoT market remains very much in its nascent stage. While agencies might understand its promise and potential, the true security ramifications must still be examined. One thing’s for certain: Agency IT administrators must fortify their networks now.


A good first step toward meeting the security challenges is through user device tracking, which lets administrators closely monitor devices and block rogue or unauthorized devices that could compromise security. With this strategy, administrators can track endpoint devices by message authentication code and internet protocol addresses, and trace them to individual users.


In addition to tracking the devices themselves, administrators also must identify effective ways to upgrade the firmware on approved devices, which can be an enormous challenge. In government, many firmware updates are still executed through a manual process.


Simultaneously, networks eventually must be able to self-heal and remediate security issues within minutes instead of days, significantly reducing the damage hackers can cause. NSA, DHS, and Defense Advanced Research Projects Agency have been working on initiatives, some of which are well underway.


While the challenges of updates and remediation are being addressed, administrators must devise an effective safety net to catch unwanted intrusions. That’s where log and event management come into play. Systems automatically can scan for suspicious activity and actively respond to potential threats by blocking internet protocol addresses, disabling users, and barring devices from accessing an agency’s network. Log and event management provide other benefits, including insider threat detection and real-time event remediation.


Regardless of its various security challenges, the IoT has great promise for the Defense Department. The various connections, from warfighters’ uniforms to tanks and major weapons systems, will provide invaluable data for more effective modern warfare.


Find the full article on SIGNAL.

They lurk in the shadows, they creep in the dark

You may hear them shriek, howl, grunt, or bark

Fact or fiction, it’s hard to be sure

If these creatures are caught on camera, they’re only a blur

Their stories have been told for hundreds of years

Each one a lesson that forces you to confront your fears

Now it’s your turn to vote and decide forevermore

Who should be crowned the most legendary of all folklore?


Starting today, 33 of the most mythical creatures will battle it out until only one remains and reigns supreme as the ultimate legend.


The starting categories are as follows:


  • Cryptids
  • Half & Halfs
  • The Gruesomes
  • Fairy Tales


We picked the starting point and initial match-ups; however, just like in bracket battles past, it will be up to the community to decide who they think is the most legendary contestant.


*NEW* Submit your bracket: To up the ante this year, we’re giving you a chance to earn 1,000 bonus THWACK points if you correctly guess the final four bracket contestants. To do this, you’ll need to go to the personal bracket page and select your pick for each category. Points will be awarded after the final four are revealed.



Bracket battle rules:


Match-up analysis:

  • For each urban legend, we’ve provided reference links to wiki pages—to access these, just click on their name on the bracket
  • A breakdown of each match-up is available by clicking on the VOTE link
  • Anyone can view the bracket and match-ups, but in order to vote or comment, you must have a THWACK® account



  • Again, you must be logged in to vote and trash talk
  • You may vote ONCE for each match-up
  • Once you vote on a match, click the link to return to the bracket and vote on the next match-up in the series
  • Each vote earns you 50 THWACK points. If you vote on every match-up in the bracket battle, you can earn up to 1,550 points



  • Please feel free to campaign for your favorite legends and debate the match-ups via the comment section (also, feel free to post pictures of bracket predictions on social media)
  • To join the conversation on social media, use hashtag #SWBracketBattle
  • There is a PDF printable version of the bracket available, so you can track the progress of your favorite picks



  • Bracket release is TODAY, March 19
  • Voting for each round will begin at 10 a.m. CDT
  • Voting for each round will close at 11:59 p.m. CDT on the date listed on the bracket home page
  • Play-in battle opens TODAY, March 19
  • Round 1 OPENS March 21
  • Round 2 OPENS March 26
  • Round 3 OPENS March 29
  • Round 4 OPENS April 2
  • Round 5 OPENS April 5
  • Ultimate legend announced April 11


If you have any other questions, please feel free to comment below and we’ll get back to you!


Who (or what) will be crowned the ultimate legend?

We’ll let the votes decide!


Access the bracket overview HERE>>

In system design, every technical decision can be seen as a series of trade-offs. If I choose to implement Technology A it will provide a positive outcome in one way, but introduce new challenges that I wouldn’t have if I had chose Technology B. There are very few decisions in systems design that don’t come down to tradeoffs like this. This is the fundamental reason why we have multiple technology solutions that solve similar problem sets. One of the most common tradeoffs we see is in how tightly, or loosely, technologies and systems are coupled together.  While coupling is often a determining factor in many design decisions, many businesses aren’t directly considering the impact of coupling in their decision making process. In this article I want to step through this concept, defining what coupling is and why it matters when thinking about system design.


We should start with a definition. Generically, coupling is a term we use to indicate how interdependent individual components of a system are. A tightly coupled system will be highly interdependent, where a loosely coupled system will have components that run independent from each other. Let’s look at some of the characteristics of each.


Tightly coupled systems can be identified by the following characteristics:


  • Connections between components in the system are strong
  • Parts of the system are directly dependent on one another
  • A change in one area directly impacts other areas of the system
  • Efficiency is high across the entire system
  • Brittleness increases as complexity or components are added to the system


    Loosely coupled systems can be identified by the following characteristics:


  • Connections between components in the system are weak
  • Parts within the system run independently of other parts within the system
  • A change in one area has little or no impact on other areas of the system
  • Sub-optimal levels of efficiency are common
  • Resiliency increases as components are added


So which is better?


Like all proper technology questions, the answer is “It depends!”  The reality is that technologies and architectures sit somewhere on the spectrum between completely loose and completely tight, with both having advantages and disadvantages.


When speaking of systems, efficiency is almost always something we’re concerned about so tight coupling seems like a logical direction to look. We want systems that act in a completely coordinated fashion, delivering value to the business with as little wasted effort or resources as possible. It’s a noble goal. However, we often have to solve for resiliency as well, which logically points to loosely coupled systems. Tightly coupled systems become brittle because every part is dependent on the other parts to function. If one part breaks, the rest are incapable of doing what they were intended to do. This is bad for resiliency.


This is better understood with an example, so let’s use DNS as a simple one.


Generally speaking, using DNS instead of directly referencing IP addresses gives efficiency and flexibility to your systems. It allows you to redirect traffic to different hosts at will by modifying a central DNS record rather than having to change an IP address reference in multiple locations. It also is a great central information repository on how to reach many devices on your network. We often recommend that applications should use DNS lookups, rather than direct IP address references, because of the additional value it provides. The downside is that this name reference now introduces a false dependency. Many of your applications can work perfectly fine without referring to DNS, but by introducing it into them you have tightened coupling between the DNS system and your application. An application which could previously run independently now depends on name resolution and your applications fails if DNS fails.


In this scenario you have a decision to make. Does the value and efficiency of adding DNS lookups to your application outweigh the deterrent of now needing both systems up and running for your application to work. You can see this is a very simple example, but as we begin layering technology, on top of technology, the coupling and dependencies can become both very strong and very hard to actually identify. I’m sure many of you have been in the situation where the failure of one seemingly unrelated system has impacted another system on your network. This is due to hidden coupling, interaction surfaces, and the law of unintended consequences.


To answer the question “Which is better?” again, there is no right answer. We need both. There are times where highly coordinated action is required. There are times when high levels of resilience is required. Most commonly we need both. When designing and deploying systems, coupling needs to be considered so you can mitigate the downsides of each while taking advantages of the positives they provide.

Most enterprises rely on infrastructure and applications in the cloud. Whether it’s SaaS services like Office 365, IaaS in AWS, PaaS in Azure, or analytics services in Google Cloud, organizations now rely on systems that do not reside on their infrastructure. Unfortunately, connectivity requirements are often overlooked when the decision is made to migrate services to the cloud. Cloud service providers downplay connectivity challenges, and organizations new to cloud computing don’t know the right questions to ask. 


SaaS: It’s just the Internet

When organizations begin to discuss cloud infrastructure, an early assumption is that all connectivity will simply happen via the internet. While many SaaS services are accessible from anywhere via the internet, large organizations need to consider how new traffic patterns will affect their current infrastructure. For example, Office 365 recommends you plan for 10 TCP port connections per device. You can support, at most, 6,000 devices behind a single IP address. If you have a large network and a small PAT pool for client egress, PAT exhaustion will quickly become a problem.


Internet-based SaaS applications make hub-and-spoke networks with centralized internet less efficient. Many WAN solutions use local internet connections to build encrypted tunnels to other sites. You can dramatically reduce network traffic by offloading SaaS applications to a local internet connection instead of backhauling traffic to a centralized data center. However, be mindful of the impact of your security footprint as you decentralize internet access across your organization.


But What About the Data Center?

Invariably, as teams begin to build IaaS and PaaS infrastructure in the cloud, they need access to resources and data that live in an on-premises data center. Most organizations begin with IPSec tunnels to connect disparate resources. Care must be taken when building IPSec tunnels to understand cloud requirements. Many cloud teams assume dynamic routing with BGP over VPN tunnels. In my experience, most network engineers assume static routing over IPSec tunnels. Be sure to have conversations about requirements up front.


When building VPNs to the cloud, throughput can be an issue. Most VPN connections are built on underlying infrastructure with throughput limitations. If you need higher throughput than cloud VPN infrastructure will support, you will need to consider a direct connection to the cloud.


Plug Me In to the Cloud, Please

There are several options to connect directly to the cloud. If you have an existing MPLS provider, most offer services to provide direct connectivity into cloud services. There are technical limitations to these services, however. Pay special attention to your routing and segmentation requirements. MPLS connectivity will likely not be as simple as your provider describes in the sales meeting.


If you do not want to leverage MPLS service to connect to the cloud, you can provision a point-to-point circuit from your premises to a cloud service provider. Cloud services publish ample documentation for direct connections.


Another option is to lease space from a co-located provider who can peer with multiple cloud service providers (CSPs). You provide circuits and hardware that reside in the co-lo, and the co-lo provides peering services to the one or more cloud providers. Be aware that each CSP charges a direct connect fee on top of your circuit costs. There may also be data ingress and egress fees.


You Want to Route What on my Network?

Cloud service providers operate their networks with technologies similar to service providers. Many SaaS services are routable only with public IP addresses. For example, if you want to connect to SalesForce, Office365, or Azure Platform Services, you will need to route their public IP addresses on your internal network to force traffic across direct connect circuits. Network engineers who have always routed internet-facing traffic with a default route injected into their IGP will have to rethink their routing design to get full use of direct connectivity into the cloud.


I Thought the Cloud was Simple

The prevailing cloud messaging tells us that the cloud makes infrastructure simpler. There is some truth in this view from a developer’s perspective. However, for the network engineer, the cloud brings new connectivity challenges and forces us to think differently about how to engineer traffic through our networks. As you look to integrate cloud services into your on-premises data center, read up on the documentation from your cloud service provider and brush up on BGP. These tools will position you to address whatever challenges the cloud throws your way.

It's good to be home after two weeks on the road, and just in time for a foot of snow. How do I unsubscribe from Winter?


As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!


Geek Squad's Relationship with FBI Is Cozier Than We Thought

If you are committing a crime, and someone finds out you are committing a crime, they have every right to notify the authorities. There is no such thing as a "Geek Squad client privilege."


Audit finds Department of Homeland Security's security is insecure

They literally have the word "Security" in their title. Oh, the irony.


MoviePass removes 'unused' location feature that tracked cinema-goers' movements

But not until after their CEO bragged about how they were tracking everyone. Stay classy, MoviePass.


Half of All Orgs Hit with Ransomware in 2017

Well, half of the folks that reply to this survey, sure. But clickbait headlines aside, there is one important fact in this article. The fact that paying a ransom does not guarantee you get your data. If you don't have backups, you will have problems.


Fake News: Lies spread faster on social media than truth does

There's a link in this article to the MIT research paper, which is a bit of a longer read, but worth your time if you are interested. I think humans have a need to be "in the know" ahead of others, and this leads to our innate desire to spread false information faster than the truth (because we assume the truth is already known, or boring, perhaps). I think Paul McCartney said it best: "Sunday's on the phone to Monday. Tuesday's on the phone to me."


Waymo self-driving trucks are hauling gear for Google data centers

It's either self-driving trucks, or filming for the Maximum Overdrive reboot has begun.


Cory Doctorow: Let's Get Better at Demanding Better from Tech

This. So much this. The world of tech advances at an accelerated rate. It's time we find a way to demand better from tech, before we dig too deep a hole.


LIve-action footage of me shoveling snow yesterday:


Risk Management is an important part of IT. Being able to identify risks and remediation options can make a huge difference if or when disaster strikes. If you've moved part of or all of your enterprise to Office 365, you now have no control over a large portion of your IT environment. But what sorts of risks do you face, and how do you deal with them?




It has happened in the past where Office 365 has become unavailable for one reason or another. There is also a very high likelihood of it happening again in the future. One of the great things about using a cloud-based platform such as Office 365 is that enterprise IT doesn't need to maintain large amounts of the infrastructure. One of the big downfalls is that is still their problem to deal with. But what sorts of implication could this have?


What is your organization's plan if, all of a sudden, Exchange Online is unavailable? Will it grind things to a halt, or will it be a minor inconvenience? The same holds true for services such as SharePoint. If all of your critical marketing material is in SharePoint Online and the service goes down, will your salespeople be left high and dry?




Not all risk is equal. Chances are that the risk of a user deleting a document won't have the same impact as something like inbound email coming to a halt. That is why you need to measure these risks. You'll want to consider the likelihood of an event occurring, and what the impact will be if it does.


Why is this step important? By performing an assessment, you'll be able to identify areas that you can mitigate, or possibly eliminate, risks. Knowing their impact is extremely important to justify priorities, as well as budgets.




As enterprise customers, we can't control how Microsoft maintains their services. But what we can do is understand what our critical business processes are, and build contingency plans for when things fall apart.


Let's use an inaccessible Exchange Online service as an example. How can you mitigate this risk? If you are running a hybrid deployment, you might be able to leverage your on-premises services to get some folks back up and running. Other options might be services from Microsoft partners. There are, for example, services that allow you to use third-party email servers to send and receive emails if Exchange Online goes offline. When service returns, the mailboxes are merged, and you can keep chugging along like nothing happened.


If you measured your risks ahead of time, you'll hopefully have noted such a possibility.




Service availability isn't the only risk. Data goes missing. Whether it is "lost," accidentally deleted, or maliciously targeted, data needs to be backed up. If you've moved any data into Office 365, you need to think about how are you going to back it up. Not only that, but what if you have to do a large restore? How long would it take you to restore 1 TB of data back into SharePoint? What impact would that window have on users?


Although a lot of the "hands-on" management is removed from IT shops when they migrate to Office 365, that doesn't mean that their core responsibilities are shifted. At the end of the day, IT staff are responsible for making sure that users can do their jobs. Just because something is in the cloud doesn't mean that it will be problem free.

By Paul Parker, SolarWinds Federal & National Government Chief Technologist


Here is an interesting article from my colleague Joe Kim, in which he explores database heath and performance.


Part of the problem with managing databases is that many people consider database health and performance to be one and the same, but that’s not necessarily the case. Let’s take a closer look at these terms.


Health versus performance: What’s the difference?


Health and performance are certainly closely related, even interconnected. But assuming they are one in the same is potentially a recipe for disaster. If you’re homed in exclusively on your database’s health, you may be overlooking critical metrics that affect your database’s performance. Here’s why:


Database health is inclusive of data points. When you take into consideration such factors as CPU utilization, I/O statistics, and memory pressure, you can determine if your database is capable of proper performance. But these metrics alone cannot confirm that the system’s performance is running optimally.


Database performance integrates an element of time measurement to explain how database queries are being executed. It’s this time component that comes into play when talking about true performance.


Diagnosing the root cause: Database performance management best practices


Identifying the true root cause of database performance issues is the goal of every federal database manager. And yet, without the proper metrics in hand, you lack the tools necessary to resolve more comprehensive problems.


That said, let’s take a closer look at some best practices that take into account both health and performance to create efficient, well-optimized database processes.


Acquire data and metrics. You need granular metrics like resource contention and a database’s workload to identify the root cause of a performance issue. Without good, deep intelligence, you lack the ability to troubleshoot accurately and effectively.


Establish meaningful data management. Every database manager has his or her own way of arranging data, but the key is to arrange it in a way that will help you quickly identify and resolve the root cause of a potential problem. Establishing a system that allows you to do so quickly can help keep your databases running efficiently.


Triangulate issues. The ability to triangulate makes it easy to answer all-important questions regarding who, what, when, where, and why. These questions help you determine the details of a performance issue. Understanding who and what was impacted by poor performance and what caused the impact are important to know.


Review execution plans. Query optimizers are critical database components that analyze Structured Query Language (SQL) queries and determine efficient execution for those queries. The problem is that optimizers can be a bit of a black box; it’s often difficult to see what’s going on inside of them.


Establish a baseline. It’s impossible to tell if your database isn’t performing optimally if you lack a baseline of normal, day-to-day performance.


It’s a marathon, not a sprint


IT pros want their databases to be in good health and to perform optimally. While both are equally important, it’s the end result that matters.


So, make sure you are looking at all criteria of database health and performance. If you are deploying the best practices and tools to help ensure the overall health and performance of your database, your stakeholders will thank you for it.


Find the full article on our partner DLT’s blog Technically Speaking.

Leon Adato

Traveling With Joy

Posted by Leon Adato Employee Mar 12, 2018

Recently, two people I respect very much tweeted about travel, and how to remain positive and grateful while you do it. You can read those tweets here ( and here (


When I saw Jessica's first tweet, I wanted to respond, but thought, "She doesn't need my noise in her twitter feed. But when Josh jumped in with his thoughtful response, I had to join in. If you prefer tweets, you can find the starting point here. For old-fashioned folks who still like correct spelling, complete sentences, and non-serialized thoughts, read on:


First, you need to understand that I have some very strong opinions about how someone should carry themselves if they are lucky enough to get to do "exciting" travel for work. When I say exciting travel, I mean:

  • Travel to some place that YOU find exciting
  • Travel that someone ELSE might find exciting


Here's why I feel so strongly:


As I've written before (, my Dad was a musician. His combination of talent, youth, and connections (mostly talent) gave him the opportunity to join a prestigious orchestra, one that traveled extensively from the time he joined (in 1963) until he retired 46 years later. My dad went everywhere. He was escorted through Checkpoint Charlie twice in the 60s. He wandered around cold-war, iron-curtain Moscow around the same time. He traveled to Australia, Mexico, all over Europe, and, of course, to almost every state in the United States.


It was a charmed life. To be sure, he worked hard to get where he was and made sacrifices along the way. But at the end of the day, he got to play great music with talented colleagues in front of sell-out audiences around the world. It was SO remarkable, that people sometimes had a hard time believing that was all he did.


Because I would "go to work" with him from time to time (which meant a lot of sitting in the green room, wandering backstage, and standing next to him during intermission when he'd come out for some fresh air, I was privy to him meeting audience members without really being part of their conversation, which would often follow a very specific pattern:


"So what do you do during the day?" they'd ask, figuring that he--like the musicians they probably knew--did this as a side gig while they worked an office job or plied a trade to pay the bills. When they found out that this was ALL he did, that he got paid a living wage to perform music, their sense of amazement increased. That's when they would begin asking (i.e. gushing) about the traveling. While some of these people were well-off, many were folks who often had never left the state where they were born, let alone the country, let alone been on a plane. That's when it became hard to watch.


He'd shrug and say, "I get on a plane, sleep, get off the plane, get on the bus, go to the hall, rehearse, eat, play the concert, get on a bus, go to the next town, sleep, get up, rehearse, eat, play. I could be in Timbuktu or Topeka."


From my fly-on-the-wall vantage point, I'd watch the other person deflate. They had hoped to feel a sense of wonder imagining the exotic, the special. Instead, they had the dawning recognition that they might as well have been talking to a plumber about the stores he visits. (No disrespect to plumbers. You folks rock.)


As I grew up and settled into a career in IT, I never thought I'd have the kind of work that would give me opportunities to travel the way my dad did. Which is why, years later, I stood crying under the Eiffel tower. Not because of the wonder of the structure, but for the miracle that I was standing there AT ALL. I was overwhelmed by the sheer impossible magic of being in a role where traveling from Cleveland, Ohio to Paris was possible in any context other than a once-in-a-lifetime, piggy-bank-breaking vacation.


A three-month project in Brussels followed Paris. A year in Switzerland came after that. In between were shorter trips, no less inspiring for being closer to home. Just getting onto a plane and taking off was an adventure in itself.


And through it all were the people. As Jessica said in her tweet, "Thousands of unseen humans help me get to my destination." I was meeting these people, hearing their stories, and being asked to tell mine.


In those moments--in the Lyft on the way to the airport; checking in at the hotel; sitting next to someone on the shuttle to the car rental area--I'm reminded of those moments when I stood next to my dad during intermission. While there are many things about the man that I admire, he's not infallible, and there are definitely habits of his that I choose not to emulate. This is one of them.


So I try to write (sometimes more than is strictly required of me) when I go to new and different places. When I have the time and focus, I write before I go about what I hope to see/do/learn; and then I write again afterward, detailing what I saw, who I met, and how it went.


As Head Geek for SolarWinds, I write these essays partly because it's actually my job. (Best. Job. Ever.) But I also do it because I'm aware that jobs like mine are unique. I want to provide a vicarious experience for those who might want it, so that they can share a sense of wonder about the exotic, the special.


I also write so that, if someone has chosen to forego these types of opportunities, either due to ambivalence, anxiety, or uncertainty, that maybe they might find motivation, reassurance, or insight; that in reading about my experiences, they might realize they have more to gain than they thought.


Finally, I write about my travels for myself. To remind me that, like both Jessica and Josh said, in each trip, thousands of things go right and thousands of people are helping me get where I need to go. To remind me of the wonder, the exotic, the special.


And the blessing.

In the final blog of this series, we’ll look at ways to integrate Windows event logs with other telemetry sources to provide a complete picture of a network environment. The most common way of doing this is by forwarding event logs to a syslog server or SIEM tool.


The benefits of telemetry consolidation are:

  1. Scalability and performance – log collectors are built for and focused on collecting logs.
  2. False Positive Reduction – some events, even if they generate an alert, are not meaningful on their own. By combining them with other events in a query, the security analyst can determine if there was a compromise. For example multiple login failures on their own must be examined in conjunction with other events to rule out threat versus driver error.
  3. Determination of the extent of a compromise – attack detected and verified, the next step is to look for lateral movement, the route of entry to the asset initially compromised, any user specific data gleaned from the activities, failure of a security element such as a firewall or IPS to detect the issue, or conversely threat blocked at a specific point due to the successful application of the security policy. Visibility across the breadth of the organization is critical to incident response and remediation.


Windows Event Logs to a Syslog Host and Beyond


The following is an example of forwarding Windows event logs to a syslog server and from there pushing these events to a basic SIEM tool. I’m showing SolarWinds Event Log Forwarder to Kiwi Syslog Server to ELK (Elasticsearch, Logstash, Kibana) because they are great tools for illustrating the process, and they are all free in their basic form, which means you can have some gratuitous fun testing things out.


Step 1: Configure the event log forwarder agent on the host that is collecting the Windows event logs (refer to last week’s blog for configuring forwarding and collection).


Define the transport to the syslog server.



Define the event log subscription, which is the list of events to be sent to the syslog server.


Step 2: The syslog server should be configured to listen on the correct port. It will receive those events defined in the subscription above.


Step 3: The syslog server can be configured to forward events to another device, such as a SIEM tool. The example below shows how to configure an action that will forward the Windows events from the syslog server via syslog to another host. The events may have an RFC 3164 syslog header appended to them to indicate the original IP of the syslog server (useful if NAT may change the source address of the IP datagram), or you can send the syslog message using the IP of the original source of the event. Another option is to use just the original source IP address of the syslog host. This decision often relies on how the receiving host application process and indexes events.



Step 4: Install the SIEM tool, in this case Elasticsearch, Logstash and Kibana, known as the ELK stack, are installed and configured. There are some references for accomplishing this at the end of this blog.


The key concepts to bolt them together include defining a Logstash-simple.config file that takes an input (for example the TCP/514 events coming from your Syslog server), and outputs those to Elasticsearch which indexes your event data. Localhost:9200 is the default setting.


input {

     tcp {

         port => 514



output {

elasticsearch { hosts => ["localhost:9200"] }



Once Kibana is installed it will be your user interface for viewing, indexing, searching and visualizing your events. By default it runs on localhost:5601.



Your Windows logs can then become part of an overall view of all the telemetry sources and types in your network, viewable and searchable through a single interface. This enables you to build queries across all your data types. By correlating events you increase the fidelity of your investigations by adding visibility.



Working example of a threat hunt


The following table summarizes the types of information that can be gathered and analyzed from a single-pane of glass provided by a log aggregator with good search and index capabilities or a SIEM tool or service.

In this case, the initial trigger is a potential suspicious lateral movement within an organization. When investigating such an event, it’s important not to treat it as an isolated incident, even if you receive only one trigger or alert. Correlation is the key to eliminating false positives. Remember the goal is to rule out false positives, and if the threat is legitimate, you must understand the extent of the attack and when and where it began.




Indicators of Compromise


Detect unusual host to host activity

528, 529, 4624, 4625: Type3 (network) or 10 (RDP) login/logout

Network Information:
Collect Calling Workstation information Name:

Source Network Address: IP
Source Port:  Port

Verify Privilege Escalations

552, 4648

Runas or privilege escalation

Account Whose Credentials Were Used:
Account Domain: DOMAIN

Verify Schedule Tasks

602, 4698

Unusual task names scheduled and quickly deleted

Scheduled Task created:
File Name: Name

Command: Cmd

Triggers: When run

Verify PS Exec

601, 4697

Remote code execution at CMD line following service installation

Attempt to install service:
Service Name: Internal Svc Name
Service File Name: path/name
*Service Type: Code
*Service Start Type: Code

Check VirusScan logs on Hosts

Filenames, Process name, Hashes

Activities may have been attempted by other tools on the host detected and blocked.

Check Firewall Policy

Network access policies on AAA devices

Audit logs on other critical assets

Event Timestamps, IPs, Usernames

Determine if a FW or other security element should be modified to stop further attacks based on IP addresses, ports, or other IOCs

Pull Malicious File Hashes

SHA-256 etc Submit to Sandbox or Analysis Tool

Derive other IOCs representative of this malware and search events for other occurrences and better idea of time attack may have started.

Failure of rule-based element

Set of verifiable IoCs

Update rulesets, virus.dat’s, signature sets. Patch known vulnerabilities.

*The sc query command will show you information on the active services on a workstation


From this example you can see it’s a best practice to start small by reacting to the initial trigger and from here collect other important artifacts that will help you cast a wider net across the entire network. Some of these artifacts will also help you to become more proactive as IoCs can be mapped to security policies and rule sets and applied to key security elements.


Windows logs are an important tool in your attack detection toolbox. Hopefully this series has given you some useful information on best practices and deployment.


Recommended References:

I’m in Redmond this week for the Microsoft MVP Summit. This will be my ninth Summit, but I’m as excited as if it was my first. The opportunity to meet with the people that make and ship the bits, provide valued feedback on their products, and connect with other data professionals is something I treasure. Here’s hoping they keep me around for another year.


As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!


Here’s How Much Money Dropbox Saved by Moving Out of the Cloud

Probably none, because the article does not talk about how much money it costs Dropbox to manage the infrastructure themselves.


China's hypersonic aircraft would fly from Beijing to New York in two hours

That sounds cool and all, but, um, what about the G-force felt during acceleration, deceleration, and turns at that speed? Well, that sounds cool, too. Sign me up.


AI vs. Lawyers

OK, forget self-driving cars. The first thing I want AI to replace are lawyers. These results are encouraging.


The Man Who Claimed to Invent Bitcoin Is Being Sued for $10 Billion

Oh, good. With this case heading to court there’s a chance that under oath someone might unwittingly admit to how Bitcoin is more scam than currency. And I love how they want to be paid in Bitcoin. By the time this case is settled they may get enough to buy a cup of coffee, and the transaction will take six hours to process.


GitHub Hit by 1.35Tbps Memcached DDoS

“Hey, let’s just use GitHub for our source control! It’s FREE!”


New Study Shows 20% of Public AWS S3 Buckets are Writable

Proof that the cloud is just as secure as your own data center. People are going to misunderstand technology no matter where it is hosted.


The Deadlock Empire Slay dragons, master concurrency!

Because I’m a database geek and I want y’all to understand that deadlocks are caused by application code, nothing more. The next time you have a deadlock, don’t blame the DBA. Instead, take a look in the mirror.


Being able to hang out with fellow Microsoft Data Platform geeks for three days is the highlight of my year:


Thus far, we have gone over how to classify our disasters and how to have some of those difficult conversations with our organization regarding Disaster Recovery (DR). We've also briefly touched on Business Continuity, an important piece of disaster recovery. Now the time has come to gather all our information and put together something formal in terms of a Disaster Recovery plan. As easy as it sounds, it can be quite a daunting task once you begin. DR plans, just like their disasters, come in all forms, and you can go as broad or as detailed as you like. There is no real “set in stone” template or set of instructions for DR plan creation. For example, some DR plans may just cover how to get services back up and going at the 100-foot level, maybe focusing on more of a server level. Others may contain application-specific instructions for restoring services, while others cover how to recover from yet another disaster at your secondary site. The point is that it’s your organization's DR plan, so you can do as you like. Just remember that it might not be you, or even your IT department, executing the failover, so the more details the better. That said, I mentioned that once we begin to create our DR plan, it can become quite overwhelming. That is why I always recommend starting at that 100-foot level and circling back to input details later.


So, with all that said, we can conclude that our DR plans can be structured however we wish, and that’s true. A quick Google search will yield hundreds of different templates for DR plans, each unique in their own way. However, to have a legible, solid, successful DR plan, there are five sections it needs to contain.




The introduction of a DR plan is as important as one found in a textbook. Basically, this is where you summarize both the objectives and the scope of the plan. A good introduction will include all the IT services and locations that are protected, as well as the RTOs and RPOs associated with each. Aside from the technical aspect, the introduction should also contain the testing schedule and maintenance scope for the plan, as well as a history of revisions that have been made to the plan.


Roles and Responsibilities


We have talked a lot in this series about including stakeholders and application owners outside of the IT department in our primary discussions. This is the section of the plan where you will formally list all your internal and external departments and personnel who are key to each DR process that has been covered in our DR plan. Remember, execution of this plan is normally run under the event of a disaster, so names are not enough. You need brief descriptions of their duties, contact information, and even alternate contact information to ensure that no one is left in the dark.


Incident Response(s)


This is where you will include how a disaster event is being declared, who has the power to do so, and the chain of communication that shall immediately follow. Remember, we can have many different types of disasters, therefore we can also have many different types of disaster declarations and incident responses. For instance, a major fire will yield a different incident response than that of an attempted ransomware attack. We need to know who is making the declaration, how they are doing so, and whom will be contacted, so on and so forth, down the chain of command.


DR Procedures


Once your disaster has been declared, those outlined within the Roles and Responsibilities can begin to act on steps to bring the production environment back up within your secondary location. This is where those procedures and instructions are laid out, step by step, for each service that is identified within the plans’ scope. A lot of IT departments will jump right into this step, and this where our plan creation can tend to get out of control. A rule of thumb is to really start broad with your process, define any prerequisites, and then dive into details. Once you are done with that, you can circle back for yet another round of details.


For example, “Recover Accounting Services” may be a good place to start. You then can dive into the individual servers that support the service as a whole, listing out all the servers (names, IPs, etc.) you need to have available. You can then get into finer details about how to get each server up and running to support the service as a whole. Even further, you may need to make changes to the application for it to run at your secondary location (maybe you have a different IP scheme, different networks, etc.), or have support for external hardware, such as a fax server to send out purchase orders.




This is where you place a collection of any other documents that may be of value to your organization in the event of a disaster. Vendor contacts, insurance policies, support contracts, can all go into an appendix. If there is a certain procedure to recover a server (for example, you use the same piece of software to protect all services), and you've already provided--in the DR Procedures section--an exhaustive list of instructions, you can always add it here as well, and simply reference it from within the DR plan.


With these five sections filled out, you should be certain that your organization is covered in the event of a disaster. A challenge, however, may be keeping your document up to date as your production environment changes. Today’s data centers are far from the static providers they once were. We are always spinning up new services, retiring old ones, moving things to and from the cloud. Every time that happens--to be successful in DR--we need to reassess that service within our DR plan. It needs to be a living document, right from its creation, and must always be kept up to date! And remember, it’s your DR plan, so include any other documents or sections that you or your organization wants to. At the end of the day, it’s better to have more information available than not enough, especially if you aren’t the person responsible for executing it! Also, please store a copy of this at your secondary location and/or in the cloud. I’ve heard too many stories of organizations losing their DR plan along with their production site.


I’d love to hear your thoughts about all this! How do you structure your DR plans? Are you more detailed or broader in terms of laying out the instructions to recover? Have you ever had to execute a DR plan you weren’t a part of? If so, how did that change your views on creating these types of procedures and documents? Thanks for reading!

In the age of exploration, cartographers used to navigate around the world and map the coastlines of unexplored continents. The coastline of IT, and moreover the inner landscapes and features, has become much more complex than a decade ago. The cost and effort needed to perform adequate mapping the old way has gone way upwards, and manual mapping is no longer an affordable endeavor, save for a productive one. Organizations and administrators need a solution to the problem, but where to start from?


To continue on this analogy, explorers of old had a few things to help themselves: maps of the known world, navigation instruments and the stars. They also set sail to discover the vast world and uncover its riches, at the price that most of us know now. Back to our modern world: our goal is to understand which services are critical to a business service, and the reason why we want to understand this is clear. We want to ensure the delivery of IT services with the best possible uptime and performance, without disruptions if possible.


It’s essential to start from the business service view. We need to base ourselves, like explorers of old, on existing maps and features as a reference point. Each organization will have its own way of documenting (hopefully), but the most probable starting point would be a service Business Impact Assessment (BIA). The BIA would give a description of upstream and downstream dependencies of a given service, application platforms (and eventually named systems) involved in supporting the service. From there, we can eventually be led to documentation that describes an application, its components, architecture, and systems.


Creating and maintaining a catalog of business impact assessments diverges from the usual kind of works IT personnel does. It might not even be a purely IT endeavor, as compliance departments in larger organizations may own the process. Nevertheless, it is essential that IT is involved because a BIA is the ideal place to capture criticality requirements. It helps articulate how a given process or service impacts the organization’s ability to conduct business operations, assess how the organization is impacted in case of failure, and determine the steps to recover the service. Capturing adverse impact is a key activity because it helps to classify the criticality of the service itself in case of failure. Impact can be financial (loss of revenue, loss of business), reputational (loss of trust from investors/ customers/partners, press scrutiny), or regulatory (loss of trust from regulatory bodies/legislative authorities, regulatory scrutiny, regulatory audits, and eventually even revocation of license to operate in a given country/region for regulated businesses).


The inconvenience with any BIA or written document is that they are a point-in-time description of a service, which is cast in stone until the next documentation revision date. Therefore it is a necessity to engage with the business process owners, and eventually with application teams, to understand if any changes were introduced. While this allows for a better view of the current state, it has the disadvantage of being a manual process with a lot of back-and-forth interactions. Another challenge we might encounter is that the BIA strictly covers a single process, without mentioning any of the upstream/downstream dependencies, or perhaps mentioning them, but without referring to any document (because there was no BIA done for another service, for example). It might also be impossible to even get one done, because a given process could rely on a third-party service or data source, over which we have no control.


There’s also another challenge looming: Shadow IT. Shadow IT broadly characterizes any IT systems that support an organization’s business objectives, but fall outside of IT scope either by omission or by a deliberate will to conceal the existence of such systems to IT. Because these systems exist outside of a formally documented scope, or are not known to IT organizations, it is very difficult to assert their criticality, at least from an IT standpoint. Portions of business processes or entire business divisions may be leveraging external or third-party services, upon which IT has no oversight or control, and yet IT would be held responsible in case of failure.


How can IT understand the criticality of a given application service in the context of a business service when the view is incomplete or even unknown?


  • From a business perspective, the organization leadership should assert or reassert IT’s role in the organization’s digital strategy, by making IT the one-stop shop for all IT related matters. Roles and responsibilities must be well established, and the organization’s leadership (CIO / CTO) should take an official stance on how to handle shadow IT projects.
  • From a compliance perspective, clear processes must be established about services & systems documentation. The necessity to document business processes and underlying technical systems / platforms is evident, critical services from a business perspective should be documented via Business Impact Analysis and collected/regularly reviewed in the documentation that covers the organization business continuity strategy (usually a Business Continuity Plan).
  • From a technical perspective, the IT organization should be involved into compliance / documentation processes not only for review purposes but also to provide the technical standpoint and provide the necessary technical steps that fall under the Business Continuity/Disaster Recovery strategy.


To encompass these three perspectives, regular checkpoints, meetings or review can help maintain the consistency of the view and the strategy. Is this however sufficient? Unfortunately, not always. Those concepts work perfectly with consistent and stateful processes/systems, but the gradual advent of ephemeral workloads that can be spinned up or scaled down on demand becomes difficult to keep full track.


While a well-defined documentation framework is necessary to establish processes that must be adhered to, and while documented processes with prioritization and criticality levels are essential, it is also necessary to complement this approach with a dynamic and real-time view of the systems.


Modern IT operations management tools should allow the grouping of assets not only by category or location, but also by logical constructs, such as an application view or even a process view. These capabilities have existed in the past, but were always performed manually. Advanced management platforms should leverage traffic flow monitoring capabilities to understand which systems are interacting together, and logically group them based on traffic types. This requires a certain level of intelligence built into the tool. For example, in a Windows-based environment, many systems will communicate with the Active Directory domain controllers, or with a Microsoft Systems Center Configuration Manager installation. The existence of traffic between multiple servers and these servers doesn’t necessarily imply an application dependency. The same could be said on a Linux environment where traffic happens between many servers and an NTP server or a yum repository. On the other hand, traffic via other ports could hint at application relationships. A web server communicating with another server via port 3306 would probably mean a MySQL database is being accessed and would constitute plausible evidence of an application dependency.


Knowing which services are critical to a business service doesn’t require the use of a Palantir. It should be a wise blend of relying on solid business processes and on modern IT operations management platforms, with a holistic view of interactions between multiple systems and intelligent categorization capabilities.

IT organizations manage security in different ways. Some companies have formalized security teams with board-level interest. In these companies, the security team will have firm policies and procedures that apply to network gear. Some organizations appoint a manager or director to be responsible for security with less high-level accountability. Smaller IT shops have less formal security organizations with little security-related accountability. The security guidance a network engineer receives from within their IT organization can vary widely across the industry. Regardless of the direction a network engineer receives from internal security teams, there are reasonable steps he or she can take to protect and secure the network.


Focus on the Basics


Many failures in network security happen due to a lack of basic security hygiene. While this problem extends up the entire IT stack, there are basic steps every network engineer should follow. Network gear should have consistent templated configuration across your organization. Ad-hoc configurations, varying password schemes, and a disorganized infrastructure opens the door for mistakes, inconsistencies, and vulnerabilities. A well-organized, rigorously implemented network is much more likely to be a secure network.


As part of the standard configuration for your network, pay special attention to default passwords, SNMP strings, and unencrypted access methods. Many devices ship with standard SNMP public and private communities. Change these immediately. Turn off any unencrypted access methods like telnet or unsecure web (http). If your organization doesn't have a corporate password vault system, use a free password vault like KeePass to store enable passwords and other sensitive access information. Don't leave a password list lying around, stored on Sharepoint, or unencrypted on a file share. Encrypt the disk on any computer that stores network configurations, especially engineer laptops which can be stolen or left accidentally.


To Firewall or Not to Firewall


While many hyperscalers don't use firewalls to protect their services, the average enterprise still uses firewalls for traffic flowing through their corporate network. It's important to move beyond the legacy layer 4 firewall to a next-generation, application-aware firewall. For outbound internet traffic, organizations need to build policy based on more than the 5-tuple. Building policies based on username and application will make the security posture more dynamic without compromising functionality.


Beyond the firewall, middle boxes like load balancers and reverse-proxies have an important role in your network infrastructure. Vulnerabilities, weak ciphers, and misconfigurations can leave applications and services wide open for exploit. There are many free web-based tools that can scan internet-facing hosts and report on weak ciphers and easy-to-spot vulnerabilities. Make use of these tools and then plan to remediate the findings.


Keep A Look Out for Vulnerabilities


When we think of patch cycles and vulnerability management, servers and workstations are top of mind. However, vulnerabilities exist in our networking gear too. Most vendors have mailing lists, blogs, and social media feeds where they post vulnerabilities. Subscribe to the relevant notification streams and tune your feed for information that's relevant to your organization. Make note of vulnerabilities and plan upgrades accordingly.


IT security is a broad topic that must be addressed throughout the entire stack. Most network engineers can't control the security posture of the endpoints or servers at their company but they do control networking gear and middle boxes which have a profound impact on IT security. In most instances, you can take practical, common sense steps that will dramatically improve your network security posture.

By Paul Parker, SolarWinds Federal & National Government Chief Technologist


Here's an interesting article from my colleague Leon Adato, in which he suggests that honesty is best policy.


IT professionals have a tough job. They face the conundrum of managing increasingly complex and hybrid IT platforms. They must protect their networks from continually evolving threats and bad actors. Budgets are restrictive and resources slim. And there are political agendas.


Given all of these factors, it’s understandable if we might feel compelled to tell some little white lies to ourselves on occasion. “Everything’s fine,” we might say, even if we’re not entirely sure that it is true. We might also be willing to engage in some little excuses and statements of overconfidence.


However, it’s important we acknowledge we may not have all the answers. We must continue to be honest with ourselves to avoid living in a world of gray.


You Don’t Know What You Don’t Know


Sometimes it’s more difficult to truly know how your infrastructure operates. That’s especially true in hybrid IT models. It’s very difficult to gain a complete view of our entire operation without the proper monitoring tools.


As pessimistic as that may seem, sometimes users aren’t honest, particularly in agency environments with very strict rules.


If an agency has a policy against using USB devices, for example, what happens if an employee breaks that rule and introduces the potential for unnecessary risk? From the confines of IT, it is sometimes difficult to assess what might be going on in other sections of the agency, which could pose some problems.


Unearthing the Truth = No More Little White Lies


Keeping everyone honest is essential to maintaining network integrity. The best way to do that is to adopt monitoring solutions and strategies that allow our IT teams to maintain visibility and control over every aspect of our infrastructure, from applications hosted off-site to the mobile devices used over networks.


We should adopt monitoring tools that are comprehensive and encompass the full range of networked entities. These solutions should also be able to provide insight into network activity regardless of whether the infrastructure and applications are on-site or hosted. We must be able to monitor activity at the hosting site and as data passes from the hosting provider to the agency.


After all, a true monitoring solution must monitor and provide a true view of what’s going on within the network. Shouldn’t it offer the ability to probe? To drill down? Those capabilities are essential if we are to truly unearth the root cause of whatever issues we may be trying to address or avert. And with the ability to monitor connections to external sources, we’ll be able to better identify break points when an outage occurs.


Let’s not forget everyone else in the agency. It’s important to keep tabs on network traffic to identify red flags and shine a light on employees who may be using unauthorized applications, again, as a means to keep everyone honest.


Being left in the dark may lead us to rely on half-truths simply because we lack the full picture. Instead of fooling ourselves, we should seek out solutions that provide us with true clarity into our networks, rather than shades of gray. This will result in more effective and secure network operations.


Find the full article on Nextgov.

If you have done any work in enterprise networks, you are likely familiar with the idea of a chassis switch. They have been the de facto standard for campus and data center cores and the standard top tier in a three-tier architecture for quite some time, with the venerable and perennial Cisco 6500 having a role in just about every network that I’ve ever worked on. They’re big and expensive, but they’re also resilient and bulletproof. (I mean this in the figurative and literal sense. I doubt you can get a bullet through most chassis switches cleanly.) That being said, there are some downsides to buying chassis switches that don’t often get discussed. In this post, I’m going to make a case against chassis switching. Not because chassis switching is inherently bad, but because I find that a lot of enterprises just default to the chassis as a core because that’s what they’re used to. To do this I’m going to look at some of the key benefits touted by chassis switch vendors and discussing how alternative architectures can provide these features, potentially in a more effective way.


High Availability


One of the key selling features of chassis switching is high availability. Within a chassis, every component should be deployed in N+1 redundancy. This means you don’t just buy one fancy and expensive supervisor, you buy two. If you’re really serious, you buy two chassis, because the chassis itself is an unlikely, but potential, single point of failure. The reality is that most chassis switches live up to the hype here. I’ve seen many chassis boxes that have been online for entirely too long without a reboot (patching apparently is overrated). The problem here isn’t a reliability question, but rather a blast area question. What do I mean by blast area? It’s the number of devices that are impacted if the switch has an issue. Chassis boxes tend to be densely populated with many devices either directly connected or dependent upon the operation of that physical device.


What happens when something goes wrong? All hardware eventually fails, so what’s the impact of a big centralized switch completely failing? Or more importantly, what’s the impact if it’s misbehaving, but hasn’t failed completely? (Gray-outs are the worst.) Your blast radius is significant and usually comprises most or all of the environment behind that switch. Redundancy is great, but it usually assumes total failure. Things don’t always fail that cleanly.


So, what’s the alternative? We can learn vicariously from our friends in Server Infrastructure groups and deploy distributed systems instead of highly centralized ones. Leaf-spine, a derivative of Clos networks, provides a mechanism for creating a distributed switching fabric that allows for up to half of the switching devices in the network to be offline with the only impact to the network being reduced redundancy and throughput. I don’t have the ability to dive into the details on leaf-spine architectures in this post, but you can check out this Packet Pushers Podcast if you would like a deeper understanding of how they work. A distributed architecture gives you the same level of high availability found in chassis switches but with a much more manageable scalability curve. See that section below for more details on scalability.




Complexity can be measured in many ways. There’s management complexity, technical complexity, operational complexity, etc. Fundamentally though, complexity is increased with the introduction and addition of interaction surfaces. Most networking technologies are relatively simple when operated in a bubble (some exceptions do apply) but real complexity starts showing up when those technologies are intermixed and running on top of each other. There are unintended consequences to your routing architecture when your spanning-tree architecture doesn’t act in a coordinated way, for example. This is one of the reasons why systems design has favored virtualization, and now micro-services, over large boxes that run many services. Operation and troubleshooting become far more complex when many things are being done on one system.


Networking is no different. Chassis switches are complicated. There are lots of moving pieces and things that need to go right, all residing under a single control plane. The ability to manage many devices under one management plane may feel like reducing complexity, but the reality is that it’s just an exchange of one type of complexity for another. Generally speaking it’s easier to troubleshoot a single purpose device than a multi-purpose device, but operationally it’s easier to manage one or two devices rather than tens or hundreds of devices.




You may not know this, but most chassis switches rely on Clos networking techniques for scalability within the chassis. Therefore, it isn’t a stretch to consider moving that same methodology out of the box and into a distributed switching fabric. With the combination of high speed backplanes/fabrics and multiple line card slots, chassis switches do have a fair amount of flexibility. The challenge is that you have to buy a large enough switch to handle anticipated and unanticipated growth over the life of the switch. For some companies, the life of a chassis switch can be expected to be upwards of 7-10 years. That’s quite a long time. You either need to be clairvoyant and understand your business needs half a decade into the future, or do what most people do: significantly oversize the initial purchase to help ensure that you don’t run out of capacity too quickly.


On the other hand, distributed switching fabrics grow with you. If you need more access ports, you add more leafs. If you need more fabric capacity, you add more spines. There’s also much greater flexibility to adjust to changing capacity trends in the industry. Over the past five years, we’ve been seeing the commoditization of 10Gb, 25Gb, 40Gb, and 100Gb links in the data center. Speeds of 400Gpbs are on the not-too-distant horizon, as well. In a chassis switch, you would have had to anticipate this dramatic upswing in individual link speed and purchase a switch that could handle it before the technologies became commonplace.




When talking about upgrading, there really are two types of upgrades that need to be addressed: hardware and software. We’re going to focus on software here, though, because we briefly addressed the hardware component above. Going back to our complexity discussion, the operation “under the hood” on chassis switches can often be quite complicated. With so many services so tightly packed into one control plane, upgrading can be a very complicated task. To handle this, switch vendors have created an abstraction for the processes and typically offer some form of “In Service Software Upgrade” automation. When it works, it feels miraculous. When it doesn’t, those are bad, bad days. I know few engineers who haven’t had ISSU burn them in one way or another. When everything in your environment is dependent upon one or two control planes always being operational, upgrading becomes a much riskier proposition.


Distributed architectures don’t have this challenge. Since services are distributed across many devices, losing any one device has little impact on the network. Also, since there is only loose coupling between devices in the fabric, not all devices have to be at the same software levels, like chassis switches do. This means you can upgrade a small section of your fabric and test the waters for a bit. If it doesn’t work well, roll it back. If it does, distribute the upgrade across the fabric.


Final Thoughts


I want to reiterate that I’m not making the case that chassis switches shouldn’t ever be used. In fact, I could easily write another post pointing out all the challenges inherent in distributed switching fabrics. The point of the post is to hopefully get people thinking about the choices they have when planning, designing, and deploying the networks they run. No single architecture should be the “go-to” architecture. Rather, you should weigh the trade-offs and make the decision that makes the most sense. Some people need chassis switching. Some networks work better in distributed fabrics. You’ll never know which group you belong to unless you consider factors like those above and the things that matter most to you and your organization.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.