Skip navigation
1 2 3 4 Previous Next

Geek Speak

2,019 posts

Only a few shopping days remaining before Christmas! I hope you and yours are settling in for a long holiday weekend. We host our families each year, meaning there will be fifteen people here on Sunday. Naturally I ordered the Roast Beast for the main dish.


Anyway, here's a bunch of links I found on the Intertubz that you may find interesting, enjoy!


My Yahoo Account Was Hacked! Now What?

Well, for starters, you could try explaining why you are using Yahoo for email.


Verizon Rethinking The Yahoo Purchase Deal After Breach

If Verizon even has an ounce of intelligence at the highest levels they will run, not walk, away from Yahoo before the end of the week. I cannot imagine they would still want to go through with this deal.


Over 8,800 WordPress Plugins Have Flaws: Study

Yes, and this is why I make an effort to keep my plugins to a minimum and get them updated frequently. Well, that was the plan until a failed update crashed my blog for two days. So I got to play Wordpress Admin last week. Yay.


AWS Launches Managed Services

Amazon takes a step into the future of IT by launching this service. I would expect a lot of this managed service to be automated and built into the cost, just as Microsoft is currently doing with Azure.


Cisco Is Shutting Down Its Cloud

If you are like me you will read this and think "Cisco has a cloud?", followed by "Why?"


DevOps Will Underpin the Fourth Industrial Revolution

Setting aside their inflated sense of self-importance, DevOps now feel that they are about to start an industrial revolution. The author of this article should be shown a history book on technology: DevOps isn't new, it's just a marketing buzzword. And I can only hope that as a result of this "revolution" DevOps will find a way to automate away the use of the term "DevOps", right before they automate themselves out of a job.


Evernote’s new privacy policy raises eyebrows

Because awful privacy violations shouldn't be limited to governments here comes EverNote to do their part in making the world a little less safer.


My son having a lengthy discussion about requirements and deliverables:


As government agencies continue their IT modernization initiatives, administrators find themselves in precarious positions when it comes to security. That’s the overall sentiment expressed in the 2016 Federal Cybersecurity Survey1. The report found that efforts to build more modern, consolidated, and secure information technology environment networks increase security challenges, but management tools offer a potential antidote to the threats.


Modernization increased IT security challenges


Federal administrators managing the transition from legacy to modernized infrastructure face enormous challenges. The transition creates a large amount of IT complexities that burden administrators who must manage old and new systems that are very different from one another.


Many noted that consolidation and modernization efforts increase security challenges due to incomplete transitions (48 percent), overly complex enterprise management tools (46 percent), and a lack of familiarity with new systems (44 percent). Other factors included cloud services adoption (35 percent), and increased compliance reporting (31 percent).


However, 20 percent believe the transition toward more modern and consolidated infrastructures ultimately will net more streamlined and secure networks. They said replacing legacy software (55 percent) and equipment (52 percent), the adoption of simplified administration and management systems (42 percent), and having fewer configurations (40 percent) will help secure networks once the arduous transition phase is complete.


Foreign governments tie internal threats as chief concerns


For the first time, respondents said that foreign governments are just as much of a cybersecurity threat as untrained internal workers. In fact, 48 percent called out foreign governments as their top threat—an increase of 10 percentage points over our 2015 survey2.


That’s not to say that insider threats have been minimized. On the contrary. The number of people who feel insiders pose a major threat is still higher than it was just two years ago.


Investing in the right security tools can help mitigate threats


Patch management software is among the solutions administrators invest in and use to great effect, with 62 percent indicating that their agencies partake in the practice. Of those, 45 percent noted a decrease in the time required to detect a security breach, while 44 percent experienced a decrease in the amount of time it takes them to respond to a breach.


Respondents noted security information and event management (SIEM) solutions as highly effective in combating threats. While only 36 percent stated that their agencies had such tools in place, administrators who use SIEM tools felt significantly more equipped to detect just about any potential threats.


While a majority of respondents still feel their agencies are just as vulnerable to attacks now as a year ago, it is good to see an increase in the number of respondents who feel agencies have become less vulnerable. This is likely due to the fact that administrators have become highly cognizant about the potential threats and are using the proper solutions to fight them.


The Federal Cybersecurity Summary Report contains more statistics and is available for free. You might empathize with some of the findings and be surprised by others.


Find the full article on Signal.



1SolarWinds Federal Cybersecurity Survey Summary Report 2016; posted February 2016;

2SolarWinds Federal Cybersecurity Survey Summary Report 2015; posted February 2015;

It was another incredible week for the December Writing Challenge, and I continue to be amazed, touched, and humbled by the responses everyone is sharing. You folks are bringing yourself to these comments and we are all enriched because of it. I’ve never been so happy and excited to distribute THWACK points.


Here are some of the responses that caught my eye this past week:


Day 10: Count

Ben Garves (thegreateebzies) was our lead contributor again for this post, waxing philosophic on the very nature of what it is to count.


Meanwhile, joshyaf  shared the hopeful and instructive thought:

“Count your blessings. They are numerous.

Count on others. As you are counted on.

Count to ten when you are angered. Your mother was right, it helps.”


Kimberly Deal took the theme in a personal direction, saying, “I just hope, that when all is said and done, that my time here counted for something good.  We've already got enough something not good.  I'd like to be something good, just this once.”


And jamison.jennings decided to bring us back around to the technical, with

“SELECT CONVERT(date,[DateTime])as Date, COUNT(TrapID) as Traps

FROM dbo.Traps

GROUP BY CONVERT(date, [DateTime])

ORDER BY CONVERT(date, [DateTime]) DESC”


Day 11: Trust

Day 11 marked the first (but not last) lead post from Kevin Sparenberg (@KMSigma). If you use our online demo ( then you owe a debt of gratitude to Kevin and his team.


Peter Monaghan, CBCP, SCP, ITIL ver.3 voiced an experience common among THWACK denizens: “I have found "trust" to be a misunderstood word in IT, especially monitoring. For example, "I don't trust our monitoring." blah! blah! blah!”


In what is become a daily (and much welcomed) tradition in the challenge, network defender gave us a short poem:

Trust in your knowledge

You know the way,

You'll find the problem

And save the day.


Finally, tomiannelli gave us another detailed diagram – this time of the hierarchy of trust – and followed it up with this personal thought:

“I could not comprehend living life without explicitly trusting others. I thought how fearful a life like that must be. To wait for a person to exhibit most if not all of the attributes shown above before you trust them could take a lot of interaction and time.


Day 12: Forgive

Head Geek Destiny Bertucci (Dez) returned to the lead poster role, sharing some techniques for dealing with negativity and how it has helped her grow in her life.


mtgilmore1 invoked Robert Frost, with the quote, "Forgive me my nonsense, as I also forgive the nonsense of those that think they talk sense."


miseri Was one of several folks who shared an unattributed quote about the personal benefits of forgiveness: “To forgive is to set a prisoner free and discover that the prisoner was you.


And mldechjr also shared some of their insights and techniques for what many of us said was a very challenging action: “Almost every time I get mad at something or someone it pays to stop, take a breath and realize how many stupid things I have done in my life.  It then becomes easier to forgive any trespass that I was upset about. Especially knowing the forgiveness I have been shown over the years.”


While I normally find three stand-outs per day, I couldn’t NOT include an LOTR quote from one of our THWACK MVP’s, rschroeder

Chapter 8:  The Scouring Of The Shire:

‘Very well, Mr. Baggins,’ said the leader, pushing the barrier aside. ‘But don’t forget I’ve arrested you.’

‘I won’t,’ said Frodo. ‘Never. But I may forgive you.'


Day 13: Remember

Many people can guess that managing the Head Geeks is a lot like wrangling five caffeinated toddlers in a room full of puppies. Fewer people know that this job falls to Jenne Barbour (jennebarbour), along with all the other responsibilities entailed in being the senior director of Marketing here at SolarWinds. Despite this, she signed up to share a memory about memories.


In response, imm an included both an unattributed quote: “When you feel like quitting, remember why you started.”

…and their own thought on this: “Remember because if you don't then no else will remember it for you :-)”


Michael Kent shared some extremely apropos words from Stephen Hawking, "Remember to look up at the stars and not down at your feet. Try to make sense of what you see and about what makes the universe exist. Be curious. And however difficult life may seem, there is always something you can do, and succeed at. It matters that you don't just give up."


But jokerfest got the last laugh in with, “I keep forgetting to remember things.”


Day 14: Rest

If you have ever enjoyed any of the SolarWinds videos (whether they were the funny ones, the training, the teaser trailers for a new version of our products, or SolarWinds Lab) then you have enjoyed the work of Rene Lego (legoland) and her incredible team of video wizards. As the lead writer for day 14, Rene opened up about the challenges she faces and her complicated relationship with the concept of “rest”.


silverwolf took the word and turned it into an instructive acronym:

Relax  -relax yourself, you don't need to rush all the time.

Enjoy -enjoy the relaxing time, find something you like doing, maybe pick up a puzzle or go grab your favorite book, do you like drawing? Go draw something.

Sleep -make sure you get enough sleep. Lack of sleep is not good for your body or the mind, go catch some zZzZzZzz's your body needs it.

Trust - trust in yourself. Listen for the hints that your mind and body are telling you, Rest is GOOD.


In another unattributed quote, mtgilmore1 shared

“If you get tired, learn to rest, not quit.”


And desr gave us a more telling update to the old cliché, saying:  “No rest for the intelleigent”


Day 15: Change

Diego Fildes Torrijos (jamesd85) returns for his second lead post, commenting on what is at the heart of all the massive change that we see in the world (spoiler: it’s us)


THWACK MVP Radioteacher  Dec 19, 2016 12:10 AM

noted that, “Change is happening whether you adapt to it or not.  Do not be left behind.”


And mlotter elaborated on the same idea, saying, “I think change starts with having the right mindset for that change. You can talk about change all day long but what are you truly changing. Ive seen people change the way they do things or the way they communicate but when you investigate further you find they are still doing the same things with the same outcome. They are just arriving at it a different way. “


And network defender  added a thought wrapped in poetry with,

Change is coming

It will never quit,

Learn to adapt

And get over it.


Day 16: Pray

Day 16 gave one of the more “charged” words in the series, and of course the THWACK community did not disappoint with thoughts that were profound, varied, and overall respectful to the beliefs of others while remaining true to themselves.


  1. jamison.jennings opined that, “Prayer is simply conversation with God. In any good relationship there is conversation. You don't say a bunch of fancy words or magic incantations with your friends, you simply speak what's on you heart. Same goes with God, but remember, it's a relationship...the more the conversation, the better the relationship.”


Meanwhile, joshyaf Offered two sides of the same coin, saying, “Praying isn't defined entirely as talking to a deity. Sometimes prayer is just reflection on yourself or the situation. I believe in my own prayer to God and it is a necessity in life for me. Though, maybe just a little insight for those that don't. Meditation can be a form of prayer.”


And unclehooch put it succinctly with, “If you only pray when you’re in trouble… You’re in trouble!”


That wraps up week 3 of the challenge. Look for my week 4 wrap up sooner rather than later!

The story so far:


  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
  4. It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)
  5. It's Not Always The Network! Or is it? Part 5 -- by John Herbert (jgherbert)
  6. It's Not Always The Network! Or is it? Part 6 -- by Tom Hollingsworth (networkingnerd)


What happens when your website goes down on Black Friday? Here's the seventh installment, by John Herbert (jgherbert).


The View From Above: James, CEO


It's said, somewhat apocryphally, that Black Friday is so called because it's the day where stores sell so much merchandise and make so much money that it finally puts them 'in the black' for the year. In reality, I'm told it stems from the terrible traffic on the day after Thanksgiving which marks the beginning of the Christmas shopping season. Whether it's high traffic or high sales, we are no different from the rest of the industry in that we offer some fantastic deals to our consumer retail customers on Black Friday through our online store. It's a great way for us to clear excess inventory, move less popular items, clear stock of older models prior to a new model launch, and to build brand loyalty with some simple, great deals.


The preparations for Black Friday began back in March as we looked ahead to how we would cope with the usual huge influx of orders both from an IT perspective and in terms of the logistics of shipping so many orders that quickly. We brought in temporary staff for the warehouse and shipping operations to help with the extra load, but within the head office and the IT organization it's always a challenge to keep anything more than a skeleton staff on call and available, just because so many people take the Friday off as a vacation day.


I checked in with Carol, our VP of Consumer Retail, about an hour before the Black Friday deals went live. She confirmed that everything was ready, and the online store update would happen as planned at 8AM. Traffic volumes to the web site were already significantly increased (over three times our usual page rate) as customers checked in to see if the deals were visible yet, but the systems appeared to be handling this without issue and there were no problems being reported. I thanked her and promised to call back just after 8AM for an  initial update.


When I called back at about 8.05AM, Carol did not sound happy. "Within a minute of opening up the site, our third party SLA monitoring began alerting that the online store was generating errors some of the time, and for the connections that were successful, the Time To First Byte (how long it takes to get the first response content data back from the web server) is varying wildly." She continued "It doesn't make sense; we built new servers since last year's sale, we have a load balancer in the path, and we're only seeing about 10% higher traffic that last year and we had no trouble then." I asked her who she had called, and I was relieved to hear that Amanda had been the first to answer and was pulling in our on call engineers from her team and others to cover load balancing, storage, network, database, ecommerce software, servers, virtualization and security. This would be an all hands on deck situation until it was resolved, and time was not on the team's side. Heaven only knows how much money we were losing in sales every minute the site was not working for people.


The View From The Trenches: Amanda (Sr Network Manager)


So much for time off at Thanksgiving! Black Friday began with a panicked call from Carol about problems with the ecommerce website; she said that they had upgraded the servers since last year so she was convinced that it had to be the network that was overloaded and this was what was causing the problems. I did some quick checks in Solarwinds and confirmed that there were not any link utilization issues, so it really had to be something else. I told Carol that I would pull together a team to troubleshoot, and I set about waking up engineers across a variety of technical disciplines so we could make sure that everybody was engaged.


I asked the team to gather a status on their respective platforms and report back to the group. The results were not promising:

  • Storage: no alerts
  • Network: no alerts
  • Security: no alerts relating to capacity (e.g. session counts / throughput)
  • Database: no alerts, CPU and memory a little higher than normal but not resource-limited.
  • Load Balancing: No capacity issues showing.
  • Virtualization: All looks nominal.
  • eCommerce: "The software is working fine; it must be the network."


I had also asked for a detailed report on the errors showing up with our SLA measurement tool so we knew what out customers might be seeing. Surprisingly, rather than outright connection failures, the tool reported receiving a mixture of 504 (Gateway Timeout) errors and TCP resets after the request was sent. That information suggested that we should look more closely at the load balancers, as a 504 error occurs when the load balancer can't get a response from the back end servers in a reasonable time period. As for the hung sessions, that was less clear. Perhaps there was packet loss between the load balancer and those servers causing sessions to time out?


The load balancer engineers dug in to the VIP statistics and were able to confirm that they did indeed see incrementing 504 errors being generated, but they didn't have a root cause yet. They also revealed that of the 10 servers behind the ecommerce VIP, one of them was taking fewer sessions over time than the others, although the peak concurrent session load was roughly the same as the other servers. We ran more tests to the website for ourselves but were only able to see 504 errors, and never a hung/reset session. We decided therefore to focus on the 504 errors that we could replicate. The client to VIP communication was evidently working fine because after a short delay, the 504 error was sent to us without any problems, so I asked the engineers to focus on the communication between the load balancer and the servers.


Packet captures of the back end traffic confirmed the strange behavior. Many sessions were establishing without problem, while others worked but with a large time to first byte. Others still got as far as completing the TCP handshake, sending the HTTP request, then would get no response back from the server. We captured again, this time including the client-side communication, and we were able to confirm that these unresponsive sessions were the ones responsible for the 504 error generation. But why were the sessions going dead? Were the responses not getting back for some reason? Packet captures on the server showed that the behavior we had seen was accurate; the server was not responding. I called on the server hardware, virtualization and ecommerce engineers to do a deep dive on their systems to see if they could find a smoking gun.


Meanwhile the load balancer engineers took captures of TCP sessions to the one back end server which had the lower total session count. They were able to confirm that the TCP connection was established ok, the request was sent, then after about 15 seconds the web server would send back a TCP RST and kill the connection. This was different behavior to the other servers, so there were clearly two different problems going on. The ecommerce engineer looked at the logs on the server and commented that their software was reporting trouble connecting to the application tier, and the hypothesis was that when that connection failed, the server would generate a RST. But again, why? Packet captures of the communication to the app tier showed an SSL connection being initiated, then as the client sent its certificate to the server, the connection would die. One of my network engineers, Paul, was the one who figured out what might be going on. That sounds a bit like something I've seen when you have a problem with BGP route exchange...the TCP connection might come up, then as soon as the routes start being sent, it all breaks. When that happens, it usually means we have an MTU problem in the communication path which is causing the BGP update packets to be dropped.


Sure enough, once we started looking at MTU and comparing the ecommerce servers to one another, we discovered that the problem server had a much larger MTU than all the others. Presumably when it sent the client certificate,  it maxed out the packet size which caused it to be dropped. We could figure out why later, but for now, tweaking the MTU to match the other servers resolved that issue and let us focus back on the 504 errors which the other engineers were looking at.


Thankfully, the engineers were working well together, and they had jointly come up with a theory. They explained that the web servers ran apache, and used something called prefork. The idea is that rather than waiting for a connection to come in before forking a process to handle its communication, apache could create some processes ahead of time and use those for new connections because they'd be ready. The configuration specifies how many processes should be pre-forked (hence the name), the maximum number of processes that could be forked, and how many spare processes to keep over and above the number of active, connected processes. They pointed out that completing a TCP handshake does not mean apache is ready for the connection, because that's handled by the TCP/IP stack before being handed off to the process. They added that they actually used TCP Offload so that whole process was taking place in the NIC, not even on the server CPU itself.


So what if the session load meant that the apache forking process could not keep up with the number of sessions coming inbound? TCP/IP would connect regardless, but only those sessions able to find a forked process could continue to be processed. The rest would wait in a queue for a free process, and if one could not be found, the load balancer would decide that the connection was dead and would issue a 504. When they checked the apache configuration, however, not only was the number of preforked processes low, but the maximum was nowhere near where we would have expected it to be, and the number of 'spare' processes was only set to 5. The end result was that when there was a burst of traffic, we quickly hit the maximum number of processes on the server, so new connections were queued. Some connections got lucky and were attached to a process before timing out; others were not so lucky. The heavier the load, the worse the problem got, but when there was a lull in traffic, the server caught up again but now when traffic hit hard again, it only had 5 processes ready to go, and connections were delayed while waiting for new processes to be forked. I had to shake my head at how they must have figured this out.


Their plan of attack was to increase the max session count and the spare session count on one server at a time. We'd lose a few active sessions, but avoiding those 504 errors would be worth it. They started on the changes, and within 10 minutes we had confirmed that the errors had disappeared.


I reported back to Carol and to James that the issues had been resolved, and when I got off the phone with them, I asked the team to look at two final issues:


  1. Why did we not see any session RST problems when we tested the ecommerce site ourselves; and
  2. Why did PMTUd not automatically fix the MTU problem with the app tier connection?


It took another thirty minutes but finally we had answers. The security engineer had been fairly quiet on the call so far, but he was able to answer the second question. There was a firewall between the web tier and the app tier, and the firewall had an MTU matching the other servers. However, it was also configured not to allow though, nor to generate the ICMP messages indicating an MTU problem. We had shot ourselves in the foot by blocking the mechanism which would have detected an MTU issue and fixed it! For the RST issue, one of my engineers came up with the answer again. He pointed out that while we were using the VPN to connect to the office, our browsers had to use the web proxy to access the Internet, and thus our ecommerce site (another Security rule!). The proxy made all our sessions appear from a single source IP address, and through bad luck if nothing else, the load balancer had chosen one of the 9 working servers, then kept using that some server because it was configured with session persistence (sometimes known as 'sticky' sessions).


I'm proud to say we managed to get all this done within an hour. Given some of the logical leaps necessary to figure this out, I think the whole team deserve a round of applause. For now though, it's back to turkey leftovers, and a hope that I can enjoy the rest of the day in peace.



>>> Continue to the conclusion of this story in Part 8

The story so far:


  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
  4. It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)
  5. It's Not Always The Network! Or is it? Part 5 -- by John Herbert (jgherbert)


Things always crop up when you least expect them, don't they? Here's the sixth installment, by Tom Hollingsworth (networkingnerd).


The View From Above: James, CEO


One of the perks of being CEO is that I get to eat well. This week was no exception, and on Tuesday night I found myself at an amazing French restaurant with the Board of Directors. The subject of our recent database issues came up, and the rest of the Board expressed how impressed they were with the CTO's organization, in particular the technical leadership and collaboration shown by Amanda. It's unusual that they get visibility of an individual in that way, so she has clearly made a big impact. Other IT managers have also approached me and told me how helpful she is; I think she has a great career ahead of her here. As dessert arrived and the topic of conversation moved on, I felt my smartwatch buzzing as a text message came in. I glanced down at my wrist and turned pale at the first lines of the message on the screen:


URGENT! We have a security breach...


I excused myself from the table and made a call to find out more. The news was not good. Apparently, we had been sent a message saying that our customer data has been obtained, and it will be made available on the black market if we don't pay them a pretty large sum of money. It made no sense; we have some of the best security tools out there, and we follow all those compliance programs to the letter. At least, I thought we did. How did this data get out? More to the point, would we be able to avoid paying the ransom? And even if we paid it, would the data be sold anyway? If this gets out, the damage to our reputation alone will cause us to lose new business, and I dread to think how many of our affected customers won't trust us with their data any more. The security team couldn't answer my questions, so I hung up and made another call, this time to Amanda.



The View From The Trenches: Amanda (Sr Network Manager)


I used to flinch every time I picked up phone calls from James. Now I can't help but wonder what problem he wants me to solve next. I must admit that I'm learning a lot more about the IT organization around here and it's making my ship run a lot tighter. We're documenting more quickly and anticipating the problems before they happen, and we have the Solarwinds tools to thank for a large portion of that. So I was pretty happy to answer a late evening call from James earlier this week, but this call was different. The moment he started speaking I knew something bad had happened, but I wasn't expecting to hear that our customer data had been stolen and was being ransomed. How far did this go? Did they just take customer data, or have they managed to extract the whole CRM database?


It's one thing to be fighting a board implementing bad ideas, but fighting hackers? This is huge! We're about to be in for a lot of bad press, and James is going to be spending a lot of time apologizing and hoping we don't lose all our customers. James told me that I am part of the Rapid Response Team being set up by Vince, the Head of IT Security, and I have the authority to do whatever I need to do to help them find out how to get this fixed. James says he's willing to pay the ransom if the team is unable to track the breach, but he's worried that unless we find the source, he'll just be asked to pay again a week later. I grabbed my keys and drove to the office.


I had barely sat down at my desk when Vince ran into my office. He was panting as he fell into one of my chairs, and breathlessly explained the problem in more detail. The message from the hacker included an attachment - a 'sample' containing a lot of sensitive customer data, including credit card numbers and social security numbers. The hacker wanted thousands of dollars in exchange for not selling it on the black market, and there was a deadline of just two days. I asked Vince if he had verified the contents of the attachment. He nodded his head slowly. There's no question about it. Somebody has access to our data.


I asked Vince when the last firewall audit happened. Thankfully, Vince said that his team audited the firewalls about once a month to make sure all the rules were accurate. I smiled to myself that we finally had someone in IT that knew the importance of regular checkups. Vince told me that the kept things up to date just in case he had to pull together a PCI audit. I told him to put the firewalls on the back burner and think about how the data could have been exfiltrated. He told me he wasn't sure on that one. I asked if he had any kind of monitoring tool like the ones I used on the network. He told me that he a Security Incident and Event Management (SIEM) tool budgeted for next year. Isn't that always the way? I told him it was time we tried something out to get some data about this breach fast. We only had a couple of days before the hacker's deadline, so we needed to get some idea of what was going on, and quickly.


While the security engineers on the Rapid Response team continued their own investigations, Vince and I downloaded the Solarwinds Log and Event Manager (LEM) trial and installed it on my Solarwinds server. It only took an hour to get up and running. We pointed it at our servers and other systems and had it start collecting log data. We decided to create some rules for basic things, like best practices, to help us sort through the mountain of data we just started digesting. Vince and I worked to put in the important stuff, like our business policies about access rights and removable media, as well as telling the system to start looking for any strange file behavior.


As we let the system do its thing for a bit, I asked Vince if the hacker could have emailed the files out of the network. He smiled and told me he didn't think that was possible because they had just finished installing Data Loss Prevention (DLP) systems a couple of months ago. It had caught quite a few people in accounting sending social security numbers in plain text emails, so Vince was sure that anything like that would have been caught quickly. I was impressed that Vince clearly knew what he was doing. He only took over as Head of IT Security about nine months back, and it seems like he has been transforming the team and putting in just the right processes and tools. His theory was that it was some kind of virus that was sending the data out a covert channel. Being in networking, I often hear things being blamed on the latest virus of the week, so I reserved my judgement until we knew more. All we could do now was wait while LEM did its thing, and the other security engineers continued their efforts as well. By this time it was well after midnight, and I put on a large pot of coffee.


When morning came and people started to come into work, we looked at the results from the first run at the data. Vince noted a few systems which needed to be secured to fall completely within PCI compliance rules. There was nothing major found, though; just a couple of little configurations that were missed. As we scrolled down the list though, Vince found a potential smoking gun. LEM had identified a machine in sales that had some kind of unknown trojan. On the same screen, the software offered the option to isolate the machine until it could be fixed. We both agreed that it needed to be done, so we removed the network connectivity for the machine through the LEM interface until we could send a tech down to remove the virus in person. More and more people were coming online now, so perhaps one of those systems would provide another possible cause.


We kept pushing through the data; we were now 18 hours into the two-day deadline. I was looking over the list of things we needed to check on when a new event popped up on the screen. I scrolled up to the top and read through it. A policy violation had occurred in our removable device policy rule. It looked like someone had unplugged a removable USB drive from their computer, and the system was powered off right after that. I checked the ID on the machine: it was one of the sales admins. I asked Vince if they had a way of tracking violations of the USB device policy. He told me that there shouldn't have been any violations as they had set a group policy in AD to prevent USB drives from being usable. I asked him about this machine in particular. Vince knitted his eyebrows together as he thought about the machine. He told me he was sure that it was covered too, but we both decided to walk down and take a look at it anyway.


We booted up the machine, and everything looked fine as it did the usual POST and came up to the Windows login screen. Wait, though; the background for the login screen was wrong. We have a corporate image on our machines with the company logo as the wallpaper. It wasn't popular but it also prevented incidents with more colorful pictures ... like the one I was looking at right now. Wow. Somehow this user had figured out how to change their wallpaper. I wondered what else this could mean. Vince and I spent an hour combing through the system. There were lots of non-standard things we found; lots of changes that shouldn't have been possible with our group policies (including the USB device policy), and the browser history of the user was clean. Not just clean from a perspective of sites visited, but completely cleared. Vince and I started to think that this system's user was someone we wanted to chat with.


I called James and told him we had a couple of possibilities to check out. He asked us to get back to him quickly; he had notified the rest of the Board, and they were pushing to hear that we had a solution as quickly as possible. Vince and I returned to my office and I scanned the SIEM tool for any new events while Vince contacted one of his team to arrange to have the suspect computer removed and re-imaged. Five minutes in, another event popped up. The same suspect system with the group policy had triggered an event for the insertion of a USB drive. I printed out the event, and Vince and I hurried back to the sales office to find out who had turned the computer on. We found the user hard at work, typing away; until, that is, we walked up to his desk. A flurry of mouse clicks later, he was back at his desktop. Vince asked him if he had anything plugged into his computer that wasn't supposed to be there. The user, a young man called Josh, said that he didn't. Vince showed him the event printout showing a USB drive being plugged in to the computer, but Josh shook his head and said that he didn't know what that was all about.


Vince wasn't having any of it. He started asking the sales admin all about the unauthorized changes on the machine that violated the group policies in place on the network. The sales admin didn't have an answer. He started looking around and stammering a bit as he tried to explain it. Finally, Vince said that he had enough. It was obvious something was going on and he wanted to get to the bottom of it. He told Josh to step away from the computer. Josh stood up and moved to the side, and Vince sat down at the computer, clicking around the system and looking for anything out of place. He glanced at the report from the Solarwinds SIEM tool, which showed that the drive was mounted in a specific folder location and not as a drive. As soon as he started clicking in the folder structure, Josh got visibly nervous. He kept inching closer to the chair and looked like he was about to grab the keyboard. When Vince clicked into the folder structure of the drive, his eyes got wide. Josh's head dropped and he stared resolutely at the carpet.


The post-mortem after that was actually pretty easy. Josh was the hacker who had stolen the information from our database. He had stored a huge amount of customer records on the USB drive and was adding more every day. He must have hit on the idea to ask us to pay for the records as a ransom, and he might have even been planning on selling them even if we paid up, although we'll never know. Vince's team analyzed the hard drive and found the exploits Josh had used to elevate his privileges enough to reverse the group policies that prevented him from reading and copying the customer data. We later found those privilege escalations in the mountain of data the SIEM collected. If we'd only had this kind of visibility before, we might have avoided this whole situation.


James came down to deal with the issue personally. Josh was pretty much frog-marched into a conference room, with James following close behind. The door slammed shut and the ensuing muffled shouting gave me some uncomfortable flashbacks to the day that my predecessor, Paul, was fired. Then Sam from Human Resources arrived with two of our attorneys from Legal in tow, and half an hour later Josh was being escorted from the building. I'm not privy to the exactly what the attorneys had Josh sign, but apparently he won't be making any noise about what he did.


From my perspective, I've built a really good relationship with the security team now, and of course, they've asked to keep Solarwinds Log and Event Manager. LEM paid for itself many times over this week, and there's no question that at some point it will help us avoid another crisis. For now though, James told Vince and I to take the rest of the week off. I'm not going to argue; I need some sleep!



>>> Continue reading this story in Part 7


(image courtesy of Marvel)


...I learned from "Doctor Strange"

(This is the last installment of a 4-part series. You can find part 1 here, part 2 here and part 3 here)


Don't confuse a bad habit that works for a good habit

The Ancient One observes that Strange isn't, "…motivated by power or the need for acclaim. You simply have a fear of failure." He replies, "I guess that fear is what made me a great doctor." She calls him on this little bit of b.s., saying,


"Fear is what has held you back from true greatness. Arrogance and fear still keep you from learning the simplest and most significant lesson of all."


Strange asks, "Which is?"

The answer? "It's not about you."


After 30 years in IT, I've come to realize that our daily work is full of positive rewards for poor choices. We work long hours, come in early after an overnight change control, check systems on our days off, learn new skills for work on our own time, don't venture too far from a network connection, just in case, and so on. We do this because we are rewarded for giving 110 percent. We’re lionized (at least for a moment) when we manage to bring up the crashed system in record time; we receive bonuses and other incentives for closing the largest number of tickets, and so on.


But that doesn't make any of those behaviors good.


I'm not saying that sometimes putting in longer hours, or more effort, or rushing to help rescue a system or team is a bad thing. But our motivation for doing so – like fear of failure – should be identified for what it is and dealt with honestly.


RTFM before you try running commands

After being firmly warned about the perils of manipulating time, Strange grumps, "Why don't they put the warnings before the spell?" Later, he repeats this sentiment as the villain is hoisted on his own mystical petard.


Often, we find a potential solution and rush pell-mell into implementation without testing, or, as in the case with code, you find in the middle of a long forum thread, without reading to the end to find out it doesn't really address your issue, and, in fact, breaks several other things. Or worse, you discover that someone decides to be a smart@ss and tell you the solution is to run rm-fr / as root. If you don't read down to the next post, you may never find that warning that tells you this would erase all the files on your system.


This is the reason all IT pros should know the magical incantation, RTFM.


Being flawed doesn't mean you're broken.

Kaecilius, the villain of the movie, points out at one point that Kamar Taj is filled with broken souls to whom the Ancient One teaches "parlor tricks, and keeps the real power for herself.” While the second half of that sentiment is clearly not true, the first half has some merit. Look closely and you can see that each character you meet in the mystical fortress is flawed, either externally (in the case of Master Hamir, who is missing his left hand) or internally (as with Mordo, battling his inner demons). What is interesting is that, while some of the characters succumb to obstacles related to these flaws, none allow themselves to be defined by those flaws.


It is obvious to the point of cliché that none of us are perfect. Nor have any of us had perfect IT training, or career paths, or experiences. But those flaws, deficiencies, and missteps do not invalidate us as people, nor do they disqualify us as credible sources of IT expertise.


Artist Allie Jenson once said,


"I am proud of my flaws and mistakes. They are the building blocks of my strengths and beauty.”


In fact, the Japanese practice of Kintsugi is the art of taking flaws in an object and emphasizing them to create even greater beauty in the piece.


We need to remind ourselves that the ways in which we live with – and  sometimes overcome –our flaws are often what makes us special.


The path to mastery is not easy, but simple

Sitting at the feet of the Ancient One, Strange despairs of learning the secrets of the magic she offers. "But even if my fingers were able to do that," he says, "How do I get from here..." (indicating where he's sitting) " there." (pointing to where she sits.) She asks, "How did you become a doctor?" He answers, "Study and practice. Years of it."


Over the course of my 30-year career in IT, I’ve had the privilege to work with an astounding number of brilliant minds. These talented engineers and designers have unselfishly passed along hints and secrets on a daily basis. For that, I am sincerely grateful.


Even so, none of what we do comes easily. It requires, as Doctor Strange observed, study and practice, and often years of it to truly develop mastery. And usually in IT, the thing we're trying to master is a moving target, morphing from one form to another as technology continues to evolve at a breakneck pace.


But despite that, the mastery we acquire is rarely as impossible as it feels on that first day when we attempt to write our first line of code, configure our first router, or install our first server.



Even if words aren’t spells, they have power and must be treated with care

In the moments before Strange exposes the secret of the Ancient One's long life, she warns him, "Choose your next words very carefully, Doctor Strange." Not heeding her warning, Strange barrels on. In doing so, he sows the seeds of distrust and anger that ultimately lead to his friend Mordo becoming a lifelong nemesis.


It's important to recognize that nothing that Strange said was wrong. Nor was he wrong in challenging the Ancient One's choices. But doing so publicly, and in anger, and using the words he did, created more problems than he could have ever predicted.


In IT, we place great value in the Truth. In fact, I’ve written about it a lot lately:


But there is a difference between being honest and being insulting; between being assertive and aggressive; between uncovering the truth and exposing faults purely for the sake of diminishing.


It's an undeniable reality that the world has become more crass. Dangerously so, in fact. Not just as IT professionals, but as good faith participants in humanity, we have the ability and responsibility to change that trend, if we can. It means that even when we understand the pure facts, that we, like Doctor Strange, also choose our words very carefully.


Never doubt, diminish, or dismiss your value or importance

Denying that magic exists, Doctor Strange exclaims, "We are made of matter and nothing more. We're just another tiny, momentary speck in an indifferent universe." This is the point at which the Ancient One opens Strange’s eyes to the infinitude of reality, and asks, "Who are you in this vast multiverse, Mr. Strange?" The question is not meant to diminish Strange, but to point out that there is, in fact, a place and role and opportunity for greatness for every living being.


Walk into the convention hall at Cisco Live!, Microsoft Ignite, VMWorld, or CeBIT, and you begin to grasp the enormity of the IT community. In doing so, it's easy to believe that nothing we have to say or contribute is new or even meaningful in any way. We fall into the trap of being a technological Ecclesiastes, thinking there's nothing new under the sun.


The truth is that nothing could be further from the truth. It is our experiences, and our willingness to share them, that makes IT such a vibrant profession and community of individuals. Our struggles provide the motivation for solutions that otherwise would never be imagined. It is the intersection of our humanity with our abilities that create the compelling stories that inspire the next generation of IT professionals.


Did you find your own lesson when watching the movie? Discuss it with me in the comments below.



It takes a radio signal about 1.28 seconds to get to the Moon (about 239,000 miles away), and about 2.5 seconds for round trip communication between our secret moon base and Earth. So, therefore this common SQL Server error message number 833...


SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\SQL\database.mdf] in database [database]. The OS file handle is 0x0000000000000000. The offset of the latest long I/O is: 0x00000000000000


...implies that the round trip time is over 15 seconds, so using 7.5 seconds (as a minimum estimate, we really don't know how long it is taking) we see the underlying SAN disks are over 1,396,500 miles away, or about 5.8 times as far away as the Moon. No, I don't have any idea how they got there, either. But how else to explain this error? For all I know this SAN could be on Mars!


Now, I've seen this error message many times in my career. The traditional answers you find on the internet tell you to look at your queries and try to figure out which ones are causing you the I/O bottleneck. In my experience, this guidance was wrong more than 95% of time. In fact, this is the type of guidance that usually results in people wanting to just throw hardware at the problem. I've seen that error message appear with little to no workload being run against the instance.


In my experience the true answer was almost always "shared storage" or "the nature of the beast that is known as a SAN". Turns out that when several servers share the same storage arrays you can end up being a victim to what is commonly called a "noisy neighbor". One workload, on one particular server, causing performance pain for a seemingly unrelated server elsewhere.


What's more frustrating is that sometimes the only hint of the issue is with the SQL Server error message. Often the conventional tools used to monitor the SAN don't necessarily show the problem, as they are focusing on the overall health of the SAN and not on the health of specific applications, database servers, or end-user experience.


And just when I thought I had seen it all when it comes to the error message above, along comes something new for me to learn.


Snapshot and Checkpoints

No, they aren't what's new. I've been involved with the virtualization of database servers for more than eight years now and the concept of snapshots and checkpoints are not recent revelations for me. I've used them from time to time when building personal VMs for demos and I've seen them used sparingly in production environments. Why the two names? To avoid confusion, of course. (Too late.)


The concept of a snapshot or checkpoint is simple: to create a copy of the virtual machine at a point in time. The reason for wanting this point in time copy is simple as well: recovery. You want to be able to quickly put the virtual machine back to the point in time created by the snapshot or checkpoint. Think of things like upgrades or service packs. Take a snapshot of the virtual machine, apply changes, sign off that everything looks good, and remove the snapshot. Brilliant!


How do they work?

For snapshots in VMWare, the documentation is very clear:

When you create a snapshot, the system creates a delta disk file for that snapshot in the datastore and writes any changes to that delta disk.

So, that means the original file(s) used for the virtual machine become read-only, and this new delta file stores all of the changes. To me, I liken this to the similar "copy-on-write" technology in database snapshots inside of SQL Server. In fact, this VMWare KB article explains the process in the same way:

The child disk, which is created with a snapshot, is a sparse disk. Sparse disks employ the copy-on-write (COW) mechanism, in which the virtual disk contains no data in places, until copied there by a write.

OK, so we know how they work, so let's talk about their performance impact.


Are they bad?

Not at first, no. But just like meat left out overnight they can become bad, yes. And the reason why should be very clear: the longer you have them, the more overhead you will have as the delta disk keeps track of all the changes. Snapshots and checkpoints are meant to be a temporary thing, not something you would keep around. In fact, VMware suggests that you keep a snapshot for no more than 72 hours, due to the performance impact. Here's a brief summary of other items from the "Best practices for virtual machine snapshots in the VMware environment" KB article:


  • Snapshots are not backups, and do not contain all the info needed to restore the VM. If you delete the original disk files, the snapshot is useless.
  • The delta files can grow to be the same size as the original files. Plan accordingly.
  • Up to 32 snapshots are supported (unless you run out of disk), but you are crazy to use more than 2-3 at any one time.
  • Rarely, if ever, should you use a snapshot on high-transaction virtual machines such as email and database servers.
  • Snapshots should only be left unattended for 24-72 hours, and don't feed them after midnight, ever.


OK, I made that last part up. You aren't supposed to feed them, ever, otherwise they become like your in-laws at Christmas and they will never leave.


So, snapshots and checkpoints can have an adverse affect on performance! And I found out about it through this Spiceworks thread, then from other articles on the internet that detailed this very same issue.


So this performance issue wasn't exactly an unknown, but rather new to me since I hadn't come across issues related to snapshots or thought to check for them in production. And, from what I can tell, most people don't have this experience either, hence the reason for scratching our heads when we see the affects of snapshots and checkpoints on our database servers.


Do I have one?

I don't know, you'll need to look for yourself. For VMware, you have three methods as detailed in this KB article:


1. Using vSphere

2. Using the virtual machine Snapshot Manager

3. Viewing the virtual machine configuration file

4. Viewing the virtual machine configuration file on the ESX host


Yes, that's four things. I didn't write the KB article. You can read it for yourself. Consider number 4 to be a bonus option or something. Or maybe they meant to combine the last two. Again, I didn't write the article, I'm just pointing you to what it says.


Now, for Hyper-V, we can look at the Hyper-V Manager GUI as well, which is essentially similar to using vSphere. But we could also use the Hyper-V Powershell cmdlets as listed here. In fact, this little piece of code is all you really need:

PS C:\> get-vm "vmName" | get-vmsnapshot

Also worth mentioning here is that Virtualization Manager tracks snapshots as well. You can find information about sprawl and snapshots here.



Snapshots and checkpoints are fine tools for you to use, but when you are done with them you should get rid of them, especially for a database server. Otherwise you can expect to see a lot of disk latency and high CPU as a result. And should you see such things but your server team reports back that everything looks normal, I hope this post will stick in your head enough for you to remember to go looking for any rogue snapshots that may exist.

With Christmas just under two weeks away most of the corporate world is in what I call "holiday mode", that period of time when work needs to get done but the urgency wanes as everyone is forced to balance work and holiday tasks. Toss in a few snow days that close or delay school and it's easy to see how work schedules can be hectic for a period of time well beyond the holiday season.


Of course that won't stop me from putting together the Actuator each week. So here's a bunch of links I found on the Intertubz that you may find interersting, enjoy!


'Crime as a Service' a Top Cyber Threat for 2017

Just a reminder that things can, and will, get worse before they get better.


Microsoft to offer option of 16 years of Windows Server, SQL Server support through new Premium Assurance offer

Just what we wanted for Christmas, six more years of supporting Windows 2008 and SQL Server 2008!


Six maps that show the anatomy of America’s vast infrastructure

Because I like maps and I think you should too, here's one showing roads, railroads, and even the state of disrepair of our bridges. I'd like to see these same graphs over time, to get a sense if we are falling further behind on infrastructure upkeep.


Who needs traditional storage anymore?

Eventually we will reach a point where all the "nerd knobs" are taken away. We won't be tuning hardware. The traditional resource bottlenecks (memory, CPU, disk, network) will be out of our hands.


We’re set to reach 100% renewable energy — and it’s just the beginning

Say what you want about Google, but their efforts in this area are quite admirable.


How I detect fake news - O'Reilly Media

The downside to this is the amount of effort it takes to verify a story. Trusted resources are hard to come by these days. This is especially true when most "news" programming is essentially editorials and opinions. Gone are the days of merely reporting on an event, now we are subjected to an endless (and mindless) spouting of opinions AS facts, leading to mass confusion for everyone.


Using AWS Lambda to call and text you when your servers are down

Hey folks, I just wanted you to know that tools like this already exist. No need to reinvent the wheel here, you know.


A primer on blockchain

In case you were wondering what all the hype was about blockchain, here's an easily digestible infographic.


Total Cost of Ownership (TCO) Calculator

In case you needed to provide some data to your CFO on reasons to migrate to the cloud.


Last week in Orlando I delivered 4 sessions in 3 days at 2 events, but this was by far my favorite:

thwack - 1.jpg

It can be tough to get a good handle on government agencies’ increasingly complex database environments. Today, federal database administrators are in charge of everything ranging from on-premises solutions to cloud or hybrid systems. DBAs are like the central nervous system of the human body -- they are in charge of disseminating information throughout the entire agency.


That’s a big responsibility, and things are not going to get much easier anytime soon. The amount of data will skyrocket, and concerns surrounding security, efficiency and cost will continue. Fortunately, there are a few ways DBAs can reduce headaches and database management complexities.


1. Make sure that everything is on the same page, especially when it comes to application response times.


In order to streamline processes, it’s vitally important to ensure that all databases have a common set of goals, metrics and service-level agreements. Acceptable application response times will vary depending on unique needs.


Work with management to determine appropriate response times, and then implement the solutions that can deliver on that agreement. If applications aren’t responsive, or databases aren’t doing their jobs, then productivity and uptime could be significantly impacted, affecting the delivery of the agency’s mission.


2. Carefully document your processes and implement log and event management.


To help keep a close eye on all of the data that’s passing through a network and to ensure its security, establish a documentation system. Begin by documenting a consistent set of processes for database backup and restore, data encryption, detection of anomalies and potential security threats.


Log and event management tools can send alerts when suspicious activity is spotted in the log data. By doing so, you’ll be able to respond to them in a timely manner and automatically kill suspicious applications.


3. Reduce workload costs by planning ahead.


If you are considering moving to the cloud, there are a couple of things to keep in mind. First, carefully map out a strategy and establish guidelines. Be sure to deploy on a certified platform, and plan everything to ensure that the transition is seamless.


Second, consider moving to cloud solutions with lower licensing costs or to open source software, which is often less expensive. Remember that the goal of a DBA is not only to help provide colleagues with better, faster and more secure data access, it’s also to help save the agency money.


4. Keep things in perspective so you don’t go crazy.


No one said database administration was going to be easy. Government data is a tough business, and it’s only going to get tougher.


But, it can also be incredibly rewarding. Think of it: DBAs are the foundation of everything that happens in the agency. They control where the information goes, whether or not critical applications are working properly and, in effect, how effectively the agency completes its mission.


Yes, a DBA’s role is extremely complex. But making a few simple adjustments can reduce that complexity, ensuring that information keeps pumping and the agency’s vital operations stay healthy.


Find the full article on Government Computer News.

The story so far:


  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
  4. It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)


Easter is upon the team before they know it, and they're being pushed to make a major software change. Here's the fifth installment, by John Herbert (jgherbert).


The View From Above: James (CEO)


Earlier this week we pushed a major new release of our supply chain management (SCM) platform into production internally. The old version simply didn't have the ability to track and manage our inventory flows and vendor orders as efficiently as we wanted, and the consequence of that has been that we've missed completing a few large orders in their entirety because we have been waiting for critical components to be delivered. Despite the importance of this upgrade to our reputation for on-time delivery (not to mention all the other cost savings and cashflow benefits we can achieve by managing our inventory on a near real-time basis), the CTO has been putting this off for months because the IT teams have been holding back on giving the OK. Finally the Board of Directors had enough with the CTO's push back, and as a group we agreed that there had been plenty enough time for testing, and the directive was issued that unless there were documented faults or errors in the system, IT should proceed with the new software deployment within the month.


We chose to deploy the software over the Easter weekend. That's usually a quieter time for our manufacturing facilities, as many of our customers close down for the week leading up to Easter. I heard grumbling from the employees about having to work on Easter, but there's no way around it. The software has to launch, and we have to do whatever we need to do to make that happen, even if that means missing the Easter Bunny.


The deployment appeared to go smoothly, and the CTO was pleased to report to the Board on Monday morning that the supply chain platform had been upgraded successfully over the weekend. He reported that testing had been carried out from every location, and every department had provided personnel to test their top 10 or so most common activities after the upgrade so that we would know immediately if a mission-critical problem had arisen. Thankfully, every test passed with flying colors, and the software upgrade was deemed a success. And so it was, until Tuesday morning when we started seeing some unexplained performance issues, and things seemed to be getting worse as the day progressed.


The CTO reported that he had put together a tiger team to start troubleshooting, and opened an ongoing outage bridge. This had the Board's eyes on it, and he couldn't fail now. I asked him to make sure Amanda was on that team; she has provided some good wins for us recently, and her insight might just make the difference. I certainly hope so.


The View From The Trenches: Amanda (Sr Network Manager)


With big network changes I've always had a rule for myself that just because the change window has finished successfully, it doesn't mean the change was a success, regardless of what testing we might have done. I tend to wait a period of time before officially calling the change a success, all the while crossing my fingers for no big issues to arise. Some might call that paranoia, and perhaps they are right, but it's a technique that has kept me out of trouble over time. This week has provided another case study for why my rule has a place when we make more complex changes.


Obviously I knew about the change over the Easter weekend; I had the pleasure of being in the office watching over the network while the changes took place. Solarwinds NPM made that pretty simple for me; no red means a quiet time, and since there were no specific reports of issues, I really had nothing to do. On Monday the network looked just fine as well (not that anybody was asking), but by Tuesday afternoon it was clear that there were problems with the new software, and the CTO pulled me in to a war room where a group of us were tasked to focus on finding the cause of of performance issues being reported with the new application.


There didn't seem to be a very clear pattern to the performance issues, and reports were coming in from across the company. On that basis we agreed to eliminate the wide area network (WAN) from our investigations, except at the common points, e.g. the WAN ingress to our main data center. The server team was convinced it had to be a network performance issue, but when I got them to do some ping tests from the application servers to various components of the application and the data center, responses were coming back in 1 or 2ms. NPM also still showed the network as clean and green, but experience has taught me not to dismiss any potential cause until we can disprove it by finding what the actual problem is, so I shared that information cautiously but left the door open for it to still be a network issue that simply wasn't showing in these tests.


One of the server team suggested perhaps it was an MTU issue. A good idea, but when we issued some pings with large payloads to match the MTU of the server interface, everything worked fine. MTU was never really a likely cause--if we had MTU issues, you'd have expected the storage to fail early on--but there's no harm in quickly eliminating it, and that's what we were able to do. We double checked interface counters looking for drops and errors in case we had missed something in the monitoring, but those were looking clean too. We looked at the storage arrays themselves as a possible cause, but checking Solarwinds Storage Resource Monitor we confirmed that there were no active alerts, there were no storage objects indicating performance issues like high latency, and there were no capacity issues, thanks to Mike using the capacity planning tool when he bought this new array!


We asked the supply chain software support expert about the software's dependencies. He identified the key dependencies as the servers the application ran on, the NFS mounts to the storage arrays and the database servers. We didn't know about the database servers, so we pulled in a database admin and began grilling him. We discovered pretty quickly that he was out of his depth. The new software had required a shift from Microsoft SQL Server to an Oracle database. This was the first Oracle instance the DB team had ever stood up, and while they were very competent monitoring and administering SQL Server, the admin admitted somewhat sheepishly that he really wasn't that comfortable with Oracle yet, and had no idea how to see if it was the cause of our problems. This training and support issue is something we'll need to work on later, but what we needed right then and there was some expertise to help us look into Oracle performance. I was already heading to the Solarwinds website because I remembered that there was a database tool, and I was hopeful that it would do what we needed.


I checked the page for Solarwinds' Database Performance Analyzer (DPA), and it said: Response Time Analysis shows you exactly what needs fixing - whether you are a database expert or not. That sounded perfect given our lack of Oracle expertise, so I downloaded it and began the installation process. It wasn't long before I had DPA monitoring our Oracle database transactions (checking them every second!) and starting to populate data and statistics. Within an hour it became clear what the problem was; DPA identified that the main cause for performance problems was occurring on database updates, where entire tables were being locked rather than using more a granular lock, like row-level locking. Update queries were being forced to wait while the previous query executed and released the lock on the table, and the latency in response was having a knock-on effect on the entire application. We had not noticed this at the weekend because the transaction loads were so low out of normal business hours that this problem didn't raise its head. But why didn't this happen on Monday? On a hunch I dug into NPM and looked at the network throughput for the application servers. As I had suspected, the Monday after Easter showed the servers handling about half the traffic that hit it on the Tuesday. At a guess, a lot of people took a 4-day weekend, and when they returned to work on Tuesday, that tipped the scales on the locking/blocking issue.


While we discussed this discovery, our supply chain software expert had been tapping away on his laptop. You're not going to believe this, he said, It turns out we are not the first people to find this problem. The vendor says that they posted a HotFix for the query code about a week after this release came out, but I just checked, and we definitely do not have that HotFix installed. I don't know how we missed that, but we can get it installed overnight while things are quiet, and maybe we'll get lucky. I checked my watch; I couldn't believe it was 7.30PM already. We really couldn't get much more done that night anyway, so we agreed to meet at 9AM and monitor the results of the application of the HotFix.

The next morning we met as planned, and watched nervously as the load ramped up as each time zone came on line. By 1PM we had hit a peak load exceeding Tuesday's peak, and not a single complaint had come in. Solarwinds DPA now indicated that the blocking issue had been resolved, and there were no other major alerts to deal with. Another bullet dodged, though this one was a little close for comfort. We prepared a presentation for the Board explaining the issues (though we tried not to throw the software expert under the bus for missing the HotFix), and presented a list of lessons learned / actions, which included:


  • Set up a proactive post-change war-room for major changes
  • Monitor results daily for at least one week for changes to key business applications
  • Provide urgent Oracle training for the database team (the accelerated schedule driven by the Board meant this did not happen in time)
  • Configure DPA to monitor our SQL Server installations too


We wanted to add another bullet saying "Don't be bullied by the Board of Directors into doing something when we know we aren't ready yet", but maybe that's a message best left for the Board to mull on for itself. Ok, we aren't perfect, but we can get better each time we make mistakes, so long as we're honest with ourselves about what went wrong.



>>> Continue reading this story in Part 6

I’m a little late, but I wanted to do a quick wrap-up of last week’s challenge.

Day 3: Search

Richard Phillips  Dec 5, 2016 12:28 PM

Reminded us that it’s not all about immediate gratification: “In a day when search means typing something into a browser and getting back an answer we need to remember that the search isn't just about getting an answer, it's about learning and gaining data that can be used now and in the future.”


Meanwhile, EBeach

Waxed philosophical with the quote: “A man travels the world over in search of what he needs and returns home to find it – George A. Moore.”


And tomiannelli Added some philosophical thoughts of his own, but closed with:

“Remember that sometimes not getting what you want is a wonderful stroke of luck.”

  ― Dalai Lama XIV


Day 4: Understand

mlotter pointed out that “A true mark of Successful person is the ability to listen to understand instead of listening with the intent to reply.”


Not to be outdone, mtgilmore1 Invoked the first man on the moon, with, "Mystery creates wonder and wonder is the basis of man's desire to understand."  Neil Armstrong


And THWACK MVP  Countered one word with another when he said, “To me learning something new requires one of two two things acceptance or understanding.” And followed it up with a clip of Comedian Michael Jr as he explains the power of Why.


Michael Jr: Know Your Why - YouTube


Day 5: Accept

desr Kept it short and sweet: “Accept who you are and what you are that is all that matters. If you accept yourself your true beauty will shine.”


silverwolf started off with enthusiasm: “Working in IT, Studying in IT,  even waay back in elementary school...or even before that, I've always known I would be involved with IT way or another, with computers...with technology! I Accepted that fact a looong time ago. I mean... it was all so COOL!  It STILL IS! It's an ADVENTURE!” and then brought a series of awesome Star Trek memes.


And in what I hope is a purposeful mis-spelling, wbrown said, “Sometimes we just have to except that everything has an acception”


Day 6: Believe

Several folks started out their entries with some type of definition of the word of the day, but bleggett managed to weave the word into their analysis of the definition of the word (how meta): “Interestingly, the etymology of believe is not quite what I expected.  Not unbelievable, but I believe it's credible. 

Online Etymology Dictionary


  1. also started off with a definition, but with a bit more detail: “Belief is objective. Unfortunately, today belief is too often accepted as subjective. By quick definition:

An objective perspective is one that is not influenced by emotions, opinions, or personal feelings - it is a perspective based in fact, in things quantifiable and measurable. A subjective perspective is one open to greater interpretation based on personal feeling, emotion, aesthetics, etc.”


sparda963 Dec 6, 2016 4:42 PM reflected on how belief informs his work: “I am not much of a believer honestly. I don't take it on faith that something is going to work unless I know it is going to work. Far to many people who I have worked with in IT over the years believe that just because they did something that it will work exactly as their believe it will. This often does not turn out to be the case.”


Day 7: Choose

Day 7 is noteable because it marked the first post by another member of the SolarWinds staff – in this case Head Geek Destiny Bertucci. In response, Richard Phillips highlighted one of the biggest career divides that IT Pro’s encounter: “Ahh, Decisions, Decisions. I've heard lots of complaints over the years "That boss couldn't do my job half as good as I do and he makes so much more money" But what you see in successful people and those that "climb the ladder." is their ability to make decisions, quickly and decisively. They own the fact that they won't always be the best or sometimes even the correct decision, but they own it and own the results. That's what keeps them moving and that's why people follow them.”


prowessa chose to use their entry to express their support: “I choose to go the write path and be a better Thwackster.”


And Michael Kent quoted one of the greatest sysadmins in history (if you judge by the length of his beard), Albus Dumbledore, who said “Dark times lie ahead of us, and there will be a time when we must choose between what is easy and what is right”


Day 8: Hear

Radioteacher  Dec 11, 2016 1:29 AM

Related the word to his experience on the set of THWACKcamp this year: “I am in the white first year ThwackCamp shirt below. When sqlrockstar spoke on stage right I would stare at the back of Patrick's head knowing that from the cameras point of view it would look like I was looking at sqlrockstar. In some ways that it made it easier to focus on his voice, hearing what was said and reacting.”


Meanwhile, nedescon reflected on how hearing loss has affected their perception of the value of receiving information: “I think the most important thing to take away is there is a lot of noise that will get in the way if you let it. I truly am the only thing that I have any power or influence over. In other words, I can give my attention, but I can't always get another's.”


And  succinctly pointed out the difference between hearing and listening: “hearing is one of the 5 senses of the human body...there is no skill involved. Listening is a skill and an art, and requires focus and commitment...and separates managers from leaders.”


Day 9: Observe

SolarWinds product marketing specialist Diego Fildes Torrijos submitted the lead essay for day 9, analyzing the way we observe the world around us (and how often we choose not to).


  1.  pointed out how the power of observation can serve us both personally and as monitoring enthusiasts: “The response to the "It's the [insert scapegoat here...but we know they are going to say 'network']" is a tool like Orion that allows you to observe your environment in near real time, coorelate data points and makee a logic deduction. Observer the world around you and make wise decisions. Observe the environment you are monitoring and make informed decisions.”


Then jamison.jennings  Offered a suggestion for the next SolarWinds product line: “In the future, when AI becomes more of a regular integral part of IT, SolarWinds should reserve OBSERVE for a name for their AI module. Observe will be the watcher for changes that happen to nodes on the network and learn and adapt to the ever changing network. New interface turned up over night during a maintenance window, Observe automagically starts monitoring. Drives removed and new ones added, no problem. Observe sees those "down" volumes and will go and look for new ones and add them in without relying on a scheduled network discovery.”


And Steve Hawkins  Used a quote by Andrew Carnegie to elaborate on how observation can inform our understanding: "As I grow older, I pay less attention to what men say. I just watch what they do." Sometimes the most important statements people make is how they react to a particular situation.


Keep the amazing comments coming (they’re worth 200 THWACK points each day) and tune in every day for the next essay challenge. Thank you to all the amazing contributors!

Starting Thursday, I'll be in Israel to meet some customers, attempt to eat my body weight in kosher shwarma, and speak at DevOpsDays Tel Aviv.


Since I'll be tweeting about it (@LeonAdato and @DevOpsDaysTLV) incessantly I figured I would give you all a heads up and let you know what I hope to achieve and hope to learn. You know, besides how much shwarma I can eat before it kills me. But what a way to go!


First, very much like my time at DevOpsDays Ohio, I hope to continue to have conversations about monitoring in a world of "cattle, not pets."


Second, I am looking forward to soaking up as much knowledge as I can as our industry continues the shift from on-premises to cloud. Seeing how companies big and small are adapting to the new reality of computing is both exciting to me as a veteran of IT and a source of great insight for where monitoring may be going in the future.


Finally, I am eager to see how the flavor of DevOps changes outside of the United States. You see, even within the U.S., there are nuances. In Austin, the crowd was almost entirely developers-who-do-ops. But in Ohio, it was 70% operations folks who were coming to grips with how they've also become developers. So I expect the event in Tel Aviv is going to teach me some more about this amazing, vibrant, and diverse community.


More to come on this after the event next week!

The story so far:


  1. It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
  2. It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
  3. It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)


The holidays are approaching, but that doesn't mean a break for the network team. Here's the fourth installment of the story, by Tom Hollingsworth (networkingnerd).


The View From Above: James (CEO)


I'm really starting to see a turn around in IT. Ever since I put Amanda in charge of the network, I'm seeing faster responses to issues and happier people internally. Things aren't being put on the back burner until we yell loud enough to get them resolved. I just wish we could get the rest of the organization to understand that.


Just today, I got a call from someone claiming that the network was running slow again when they tried to access one of their applications. I'm starting to think that "the network is slow" is just code to get my attention after the unfortunate situation with Paul. I decided to try and do a little investigation of my own. I asked this app owner if this had always been a problem. It turns out that it started a week ago. I really don't want to push this off on Amanda, but a couple of my senior IT managers are on vacation and I don't have anyone else I can trust. But I know she's going to get to the bottom of it.



The View From The Trenches: Amanda (Sr Network Manager)


Well, that should have been expected. At least James was calm and polite. He even told me that he'd asked some questions about the problem and got some information for me. I might just make a good tech out of the CEO after all!


James told me that he needed my help because some of the other guys had vacation time they had to use. I know that we're on a strict change freeze right now, so I'm not sure who's getting adventurous. I hope I don't have to yell at someone else's junior admin. I decided I needed to do some work to get to the bottom of this. The app in question should be pretty responsive. I figured I'd start with the most basic of troubleshooting - a simple ping. Here's what I found out:


icmp_seq=0 time=359.377 ms

icmp_seq=1 time=255.485 ms

icmp_seq=2 time=256.968 ms

icmp_seq=3 time=253.409 ms

icmp_seq=4 time=254.238 ms


Those are terrible response times! It's like the server is on the other side of the world. I pinged other routers and devices inside the network to make sure the response times were within reason. A quick check of other servers confirmed that response times were in the single digits, not even close to the bad app. With response times that high, I was almost certain that something was wrong. Time to make a phone call.


Brett answered when I called to the server team. I remember we brought him on board about three months ago. He's a bit green, but I was told he's a quick learner. I hope someone taught him how to troubleshoot slow servers. Our conversation started off as well as expected. I told him what I found and that the ping time was abnormal. He said he'd check on it and call me back. I decided to go to lunch and then check in on him when I got finished. That should give him enough time to get a diagnosis. After all, it's not like the whole network was down this time, right?


I got back from lunch and checked in on Brett The New Guy. When I walked in, he was massaging his temples behind a row of monitors. When I asked what was up, he sighed heavily and replied, "I don't know for sure. I've been trying to get into the server ever since you called. I can communicate with vCenter, but trying to console into the server takes forever. It just keeps timing out."


I told Brett that the high ping time probably means that the session setup is taking forever. Any lost packets just make the problem worse. I started talking through things at Brett's desk. Could it be something simple? What about the other virtual machines on that host? Are they all having the same problem?


Brett shrugged his shoulders. His response, "I'm not sure? How do I find out where they are?"


I stepped around to his side of the desk and found a veritable mess. Due to the way the VM clusters were setup, there was no way of immediately telling which physical host contained which machines. They were just haphazardly thrown into resource pools named after comic book characters. It looked like this app server belonged to "XMansion" but there were a lot of other servers under "AsteroidM". I rolled my eyes at the fact that my network team had strict guidelines about naming things so we could find it at a glance, yet the server team could get away with this. I reminded myself that Brett wasn't to blame and kept digging.


It took us nearly an hour before we even found the server. In El Paso, TX. I didn't even know we had an office in El Paso. Brett was able to get his management client to connect to the server in El Paso and saw that it contained exactly one VM - The Problem App Server. We looked at what was going on and figured that it would work better if we moved it back to the home office where it belonged. I called James to let him know we fixed the problem and that he should check with the department head. James told me to close the ticket in the system since the problem was fixed.


I hung up Brett's phone. Brett spun his chair back to his wall of monitors and put a pair of headphones on his head. I could hear some electronic music blaring away at high volume. I tapped Brett on the shoulder and told him, "We're not done yet. We need to find out why that server was halfway across the country."


Brett stopped his music and we dug into the problem. I told Brett to take lots of notes along the way. As we unwound the issues, I could see the haphazard documentation and architecture of the server farm was going to be a bigger problem to solve down the road. This was just the one thing that pointed it all out to us.


So, how does a wayward VM wind up in the middle of Texas? It turns out that the app was one of the first ones ever virtualized. It had been running on an old server that was part of a resource pool called "SavageLand". That pool only had two members: the home server for the app and the other member of the high availability pair. That HA partner used to be here in the HQ, but when the satellite office in El Paso was opened, someone decided to send the HA server down there to get things up and running. Servers had been upgraded and moved around since then, but no one documented what had happened. The VMs just kept running. When something would happen to a physical server, HA allowed the machines to move and keep working.


The logs showed that last week, the home server for the app had a power failure. It rebooted about ten minutes later. HA decided to send the app server to the other HA partner in El Paso. The high latency was being caused by a traffic trombone. The network traffic was going to El Paso, but the resources the server needed to access were back here at the HQ. So the server had to send traffic over the link between the two offices, listen for the response, and then send it back over the link. Traffic kept bouncing back and forth between the two offices, which saturated the link. I was shocked that the link was even fast enough to support the failover link. According to Brett's training manuals, it barely met the minimum. We were both amused that the act of failing the server over to the backup cause more problems than just waiting for the old server to come back up.


Brett didn't know enough about the environment to know all of this. And he didn't know how to find the answers. I made a mental note to talk to James about this at the next department meeting after everyone was back from vacation. I hoped they had some kind of documentation for that whole mess. Because if they didn't, I was pretty sure I knew where I could find something to help them out.



>>> Continue reading this story in Part 5

Wow, can you believe it? 2016 is almost over, the holidays are here I didn’t even get you anything!   It’s been a bit of a wild rollercoaster of a year through consolidation, commoditization, and collaboration!


I’m sure you have some absolute favorite trends or notable things which have occurred here throughout 2016.  Here are some that in particular have been a pretty recurring trend throughout the year.



  • Companies going private such as Solarwinds (closed in February), DellEMC (closed in September)
  • Companies buying other companies and consolidating industry like Avago buying Broadcom (Closed Q1), Brocade buying Ruckus (Closed Q3), Broadcom buying Brocade (Initiated in October)
  • Or companies divesting of assets like Dell selling off SonicWall and Quest, and Broadcom selling off Brocade’s IP division



Alright so that’s some of the rollercoaster at least a small snapshot of it, and the impact those decisions will have on practitioners like you and I only time will tell (I promise some of those will be GREAT and some of those, not so much!)


But what else, what else?! Some items I’ve very recently discussed include.



All three of these net-net benefit in the end really means that we will continue to see better technology, with deeper investment and ultimately (potentially) lower costs!


On the subject of Flash though if you haven’t been tracking the Density profiles have been insane this year alone and that trend is only continuing with further adoption and better price economics with technology like NVMe.  I particularly love this image as it reflects the shrinking footprint of the data center while reflecting our inevitable need for more.


Moores Law of Storage.png



This is hardly everything that happened in 2016 but these are particular items which are close to my heart and respectively my infrastructure.   I will give a hearty congratulation to this being the 16th official “year of vdi” a title we continue to grant it yet continues to fail to fulfill on its promises.  


Though with 2016 closing quickly on our heels there are a few areas you’ll want to be on the watch for in 2017!


  • Look for Flash Storage to get even cheaper, and even denser
  • Look to see even more competition in the Cloud space from Microsoft Azure, Amazon AWS and Google GCP
  • Look to Containers to become something you MIGHT actually use on a regular basis and more rationally than the very obscure use-cases promoted within organizations
  • Look to vendors to provide more of their applications and objects as Containers (EMC did this with their ESRS (Secure Remote Support)
  • Obviously 2017 WILL be the Year of VDI… so be sure to bake a cake
  • And strangely with the exception of pricing economics making adoption of 10GigE+ and Wireless wave2 we’ll see a lot more of the same as we saw this year, maybe even some retraction in hardware innovation
  • Oh and don’t forget, more automation, more DevOps, more “better, easier, smarter”


But enough about me and my predictions, what were some of your favorite and notable trends of 2016 and what are you looking to see coming forward looking to 2017?


And if I don’t get a chance to… Happy Holidays and a Happy New Year to ya’ll!

After the network perimeter is locked down, servers are patched, and password policies enforced, end-users themselves are the first line of defense in IT security. They are often the target for a variety of attack vectors making them the first step of triage when a security incident is suspected. Security awareness training, which should be a part of any serious IT security program, should be based in common sense, but what security professionals consider common sense isn’t necessarily common sense for the average end-user.


In order to solve this problem and get everyone on the same page, end-users need the awareness, knowledge, and tools to recognize and prevent security threats from turning into security breaches. To that end, a good security awareness program should be guided by these three basic principles:


First, security awareness is a matter of culture.


Security awareness training should seek to change or create a culture of awareness in an organization. This means different things to different security professionals, but the basic idea is that everyone in the organization should have a common notion of what good security looks like. This doesn’t mean that end-users know how to spot suspicious malformed packets coming into a firewall, but it does mean that it’s part of company culture to be suspicious of email messages from unknown sources or even from known sources but with unusual text.


The concerns of the organization’s security professionals need to become part of the organization's culture. This isn’t a technical endeavor but a desire to create a heightened awareness of security concerns among end-users. They don’t need to know about multi-tenant data segmentation or versions of PHP, but they should have an underlying concern for a secure environment. This is definitely somewhat ambiguous and subjective, but this is awareness.


Second, security awareness training should empower end-users with knowledge.


After a culture of security awareness has been established, end-users need to know what to actually look for. A solid security awareness program will train end-users on what current attacks look like and what to do when facing one. This may be done simply with weekly email newsletters or required quarterly training sessions.


End-users need to actually learn why it’s not good to plug a USB stick found in the parking lot into their computer, and users need to get a good feel for what phishing emails look like. They should know that they can hover over a suspicious link and sometimes see the actual hidden URL, and they should know that even that can be faked.


Ultimately, they need to know what threats look like. The culture of awareness makes them concerned, and knowledge gives them the ability to identify actual problems in the real world.


Third, security awareness training is concerned with changing behavior.


The whole point here is that end-users take action when there is suspicion of malicious activity. Security awareness training is useless if no one takes action and actually acts like the first line of defense they really are (or can be).


A good security awareness program starts with culture, empowers end-users with knowledge, and seeks to change behavior. This means making significant effort to provide end-users with clear directions for what to do when encountering a suspected security incident. Telling users to simply “create a ticket with the helpdesk” is just not enough. End-users need clear direction as to what they can actually do in the moment when they are dealing with an issue. This is where the whole “first line of defense” becomes a reality and not just a corporate platitude.


For example, what should end-users actually do (or not do) when they receive a suspected phishing email? The directions don’t need to be complicated, but they need to exist and be communicated clearly and regularly to the entire organization.


Security awareness training is the most cost-effective part of a security program in that it doesn’t require purchasing millions of dollars of appliances and software licenses. There is a significant time investment, but the return on investment is huge if done properly. A strong security awareness training program needs to be based in common sense, change culture, empower end-users with knowledge, and change behavior.

Filter Blog

By date:
By tag: