cancel
Showing results for 
Search instead for 
Did you mean: 

Lessons Learned From the AWS Outage

Level 17

2017-03-06_11-15-52.jpg

Last week Amazon Web Services S3 storage in the East region went offline for a few hours. Since then, AWS has published a summary review of what happened. I applaud AWS for their transparency, and I know that they will use this incident as a learning lesson to make things better going forward. Take a few minutes to read the review and then come back here. I'll wait.

Okay, so it's been a few days since the outage. We've all had some time to reflect on what happened. And, some of us, have decided that now is the time to put on our Hindsight Glasses and run down a list of lingering questions and comments regarding the outage.

Let's break this down!

"...we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years."

This, to me, is the most inexcusable part of the outage. Anyone that does business continuity planning will tell you that annual checks are needed on such play books. You cannot just wave that away with, "Hey, we've grown a lot in the past four years and so the play book is out of date." Nope. Not acceptable.

"The servers that were inadvertently removed supported two other S3 subsystems."

The engineers were working on a billing system, and they had no idea that those billing servers would impact a couple of key S3 servers. Which brings about the question, "Why are those systems related?" Great question! This reminds me of the age-old debate regarding dedicated versus shared application servers. Shared servers sound great until one person needs a reboot, right? No wonder everyone is clamoring for containers these days. Another few years and mainframes will be under our desks.

"Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended."

But the command was allowed to be accepted as valid input, which means the code doesn't have any check to make certain that the command was indeed valid. This is the EXACT scenario that resulted in Jeffrey Snover adding the -WHATIF and -CONFIRM parameters into Powershell. I'm a coding hack, and even I know the value in sanitizing your inputs. This isn't just something to prevent SQL injection. It's also to make certain that as a cloud provider you don't delete a large number, or percentage, of servers by accident.

"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly."

So, they don't ever ask themselves, "What if?" along with the question, "Why?" These are my favorite questions to ask when designing/building/modifying systems. The 5-Whys is a great tool to find the root cause, and the use of "what if" helps you build better systems that help avoid the need for root cause reviews.

"We will also make changes to improve the recovery time of key S3 subsystems."

Why wasn't this a thing already? I cannot understand how AWS would get to the point that it would not have high availability already built into their systems. My only guess here is that building such systems costs more, and AWS isn't interested in things costing more. In the race to the bottom, corners are cut, and you get an outage every now and then.

"...we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3."

The AWS dashboard for the East Region was dependent upon the East Region being online. Just let that sink in for a bit. Hey, AWS, let me know if you need help with monitoring and alerting. We'd be happy to help you get the job done.

"Other AWS services in the US-EAST-1 Region that rely on S3 for storage...were also impacted while the S3 APIs were unavailable."

Many companies that rely on AWS to be up and running were offline. My favorite example is the popular website Is It Down Right Now? Website Down or Not? was itself, down as a result of the outage. If you migrate your apps to the cloud, you need to take responsibility for availability. Otherwise, you run the risk of being down with no way to get back up.

Look, things happen. Stuff breaks all the time. The reason this was such a major event is because AWS has done amazing work in becoming the largest cloud provider on the planet. I'm not here to bury AWS, I'm here to highlight the key points and takeaways from the incident to help you make things better in your shop. Because if AWS, with all of its brainpower and resources, can still have these flaws, chances are your shop might have a few, too. 

24 Comments
tallyrich
Level 15

Yes, lessons to be learned for all of us. I for one appreciate Amazon's willingness to admit and even document these mistakes. Most of the larger companies would have just beheaded the offending employee and called it a day. (Not literally of course)

There have been a couple of projects we have tried to move to "the cloud".  We have returned those systems to onsite do to issues such as above as well as the impact of availability from things such as DDOS attacks.

shuckyshark
Level 13

One great lesson learned - I was right in not trusting the cloud to provide services.

sqlrockstar
Level 17

Agreed, I love how they were transparent for everyone. And I don't want to pile on them for this, I know that mistakes happen. I wanted to use it as a learning exercise for everyone.

sqlrockstar
Level 17

Yep, when you go to the cloud, you can't always just "life and shift". Most of the time you need to rethink your architecture. Things like connectivity cannot be taken for granted, and require specific coding.

sqlrockstar
Level 17

I'm not sure I fully agree with that sentiment. Both AWS and Azure provide solid service. But interruptions are bound to happen from time to time. If you go cloud, you need to consider this fact. If you go cloud and expect it to always be there, you will be disappointed at times.

vinay.by
Level 16

Nice

ecklerwr1
Level 19

This reminds me of early days in our virtualizing of enterprise when they virtualized all of the domain controllers... then we they had a storage "issue" and the datastore went away... so did ALL of the domain controllers.  Then no one could log into anything to try and get things restarted.  For this reason most of us keep a real physical DC around... just in case!

sparda963
Level 12

I think a few companies probably would have literally done this, in front of everyone else to "show them a lesson".

gfsutherland
Level 14

yup... been bit there by that mistake.

shuckyshark
Level 13

maybe, but when the downtime happened, I was up 100%...

sparda963
Level 12

I think there needs to be a lesson in it. The cloud does have it's uses. But maybe putting your vital 24/7 systems in there might not be the best idea. Other systems that are not vital or 24/7 systems would be good candidates for this. ERP, HR/ES, Employee Portals and the such would be good candidates IMO. While it would be inconvenient for them to be unavailable for a few hours, it will not bring your entire business to a grinding halt until they are back online.

jgrimes4292
Level 9

That is the story they tell us. How do we know it was not a hack and they want to keep it on the DL?  pastedImage_0.png

mtgilmore1
Level 13

Lessons learned are always great.  I would hate to see the "pucker factor" of those admin's when the system started shutting down.  There system will be better after this problem.

tinmann0715
Level 16

While I agree with all of your questions. And yes, this is Amazon, none of this surprises me any. Amazon is plagued with the same issues the rest of us are: legacy systems, human error, unknown dependencies, untested redundancies & assumptions, and unanticipated results. Hey! Now I don't feel so bad about my IT junk anymore. :-)

Jfrazier
Level 18

Interesting photo in the meme....  It looks like a Amateur Radio field day photo....  Radioteacher​ might agree on the assessment.  I might even go so far as to call the HF rig a Yaesu FT-757....

4.jpg

Radioteacher
Level 14

I wonder if someone out there said, "Why don't they run S3 in the cloud?"

RT

Radioteacher
Level 14

I have not seen this hat at Field Day but I will be recreating this picture.  I might have to add a vacuum tube to the top of the hat and some antenna wire.

Remember that Amateur Radio Field Day, in the US, in always the fourth full weekend in June.

There is a locator map at arrl.org to find a club near you.

Yes someone in Upper North Eastern United States, most likely Connecticut, thought that getting out in late June to play radios for 24 hours straight in fields would be a great idea.  They could care a less that it is a low of 82 degrees in South Texas with highs up to 100 degrees with 90% humidity. 

But every June in San Antonio, TX we "Cowboy Up" and do it again. 

Let's see them do Winter Field Day in January!!!!

RT

jeremymayfield
Level 15

reboot.jpg

jgrimes4292
Level 9

Yeah I used to work for ACK Radio Supply back in the day. Always wanted to get my Technician class license. All I could ever afford is the foil hat . That gear is not cheap.

Jfrazier
Level 18

Tech license these days is very simple...you can get into 2m fm (mobile rig good for house too) for less than $150.

The HF rigs cost the larger bucks as well as some of the digital vhf/uhf mobile radios. 

tinmann0715
Level 16

Oddly enough, I received in my inbox this morning from one of my BCP?DR periodicals, the following. We all chuckle and grunt at Amazon's fiasco last week but it is a telling reminder that cascading events usually lead to the catastrophic scenario. Preparation at all levels mitigates catastrophes:

This week on March 11th, we mark the anniversary of the triple disaster which struck northern Japan in 2011. The earthquake, and subsequent tsunami and nuclear meltdown took over 18,000 lives and displaced over 470,000 people. Now in 2017, there are still over 127,000 people without a permanent home.

.

tallyrich
Level 15

Stories like that are part of the reason I'm a supporter of Compassion International.

byrona
Level 21

I agree with sqlrockstar​ that I totally appreciate how transparent they were with this.  We operate several data-centers and cloud services as well so I can totally sympathize with what happened to Amazon.  With the complexities of these technologies and environments it also introduces a lot of fragility into the environment.  While the technology itself may not necessarily be fragile, the fact that it's so complex means that individuals typically don't understand the entirety of it making it much easier for them to do something or make a change that has impacts beyond their understanding. 

About the Author
Thomas LaRock is a Head Geek at SolarWinds and a Microsoft® Certified Master, SQL Server® MVP, VMware® vExpert, and a Microsoft Certified Trainer. He has over 20 years experience in the IT industry in roles including programmer, developer, analyst, and database administrator.