cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Lessons Learned From the Delta Outage

Level 17

IMG_8315.JPGThis past Monday morning Delta suffered a disruption to their ticketing systems. While the exact root cause has yet to be announced I did find mention here that the issue was related to a switchgear, a piece of equipment that allows for power failover. It's not clear to me right now if Delta or Georgia Power is responsible for maintaining the switchgear, but something tells me that right now a DBA is being blamed for it anyway.

The lack of facts hasn't stopped the armchair architects taking to the internet the past 24 hours in an effort to point out all the ways that Delta failed. I wanted to wait until facts came out about the incident before offering my opinion, but that's not how the internet works.

So, here's my take on where we stand right now, with little to no facts at my disposal.

HA != DR

I've had to correct more than one manager in my career that there is a big difference between high availability (HA) and disaster recovery (DR). Critics yesterday mentioned that Delta should have had geo-redundancy in place to avoid this outage. But without facts it's hard to say that such redundancy would have solved the issue. Once I heard about it being power related, I thought about power surges, hardware failures, and data corruption. You know what happens to highly available data that is corrupted? It becomes corrupted data everywhere, that's what. That's why we have DR planning, for those cases when you need to restore your data to the last known good point in time.

This Was a BCP Exercise

Delta was back online about six hours after the outage was first reported. Notice I didn't say they were "back to normal". With airlines it takes days to get everything and everyone back on schedule. But the systems were back online, in no short part to some heroic efforts on the part of the IT staff at Delta. This was not about HA, or DR, no, this was about business continuity. At some point a decision was made on how best to move forward, on how to keep the business moving despite suffering a freak power outage event involving a highly specialized piece of equipment (the switchgear). From what I can tell, without facts, it would seem the BCP planning at Delta worked rather well, especially when you consider that Southwest recently had to wait 12 hours to reboot their systems due to a bad router.

Too Big To Failover

Most recovery sites are not built to handle all of the regular workload, they are designed to handle just the minimum necessary for business to continue. Even if failover was an option many times the issue isn't with the failover (that's the easy part), the issue is with the fallback to the original primary systems. The amount of data involved may be so cumbersome that a six hour outage is preferable to the 2-3 days it might take to fail back. It is quite possible this outage was so severe that Delta was at a point where they were too big to failover. And while it is easy to just point to the Cloud and yell "geo-redundancy" at the top of your lungs the reality is that such a design costs money. Real money.

Business Decisions

If you are reading this and thinking "Delta should have foreseen everything you mentioned above and built what was needed to avoid this outage" then you are probably someone that has never sat done with the business side and worked through a budget. I have no doubt that Delta has the technical aptitude to architect a 21st century design but the reality of legacy systems, volumes of data, and near real-time response rates on a global scale puts that prices tag into the hundreds of millions of dollars. While that may be chump change to a high-roller such as yourself, for a company (and industry) that has thin margins the idea of spending that much money is not appealing. That's why things get done in stages, a little bit at a time. I bet the costs for this outage, estimated in tens of millions of dollars, are still less than the costs for the infrastructure upgrades needed to have all of their data systems rebuilt.

Stay Calm and Be Nice

If you've ever seen the Oscar-snubbed classic movie "Roadhouse", you know the phrase "be nice". I have read a lot of coverage of the outage since yesterday and one thing that has stood out to me is how professional the entire company has been throughout the ordeal. The CEO even made this video in an effort to help people understand that they are doing everything they can to get things back to normal. And, HE WASN'T EVEN DONE YET, as he followed up with THIS VIDEO. How many other CEOs put their face on an outage like this? Not many. With all the pressure on everyone at Delta, this attitude of staying calm and being nice is something that resonates with me.

The bottom line here, for me, is that everything I read about this makes me think Delta is far superior to their peers when it comes to business continuity, disaster recovery, and media relations.

Like everyone else, I am eager to get some facts about what happened to cause the outage, and would love the read the post-mortem on this event if it ever becomes available. I think the lessons that Delta learned this week would benefit everyone that has had to spend a night in a data center keeping their systems up and running.

19 Comments
MVP
MVP

Every major outage that affects a companies ability to do business is somewhat different.  You can plan for and have many contingencies in place to mitigate the issue but you will not have thought of everything...  Murphy is always waiting in the wing for the opportunity to toss a monkey wrench into the mess and create havoc. 

Even if you test your switchgear on a regular basis, it can still fail,  The testing process itself takes up cycles in the expected life of the equipment.  Granted that should be rare but can exist.

So, I'll sit here working and wait to see what the final word is.  I agree too many armchair quarterbacks tossing out their opinion without knowing all the relevant facts.

Thanks for posting sqlrockstar​ !

Level 14

Obviously the CEO was well coached in crisis management... By getting out in front... he probably helped things out to some reasonable extent.

As for the wild guesses and (Should of, would of, could of) remarks by talking heads and "experts".... Jfrazier​ is spot on about Mr. Murphy..... It's always a case of when... not if... bad things happen.

From there it is how you deal with it.... Given the complexity and out factors of Delta's operation... The semed to do a fair job...

When the details finally come out... I suspect that nobody will backtrack on their "pronouncements"...

Well done sqlrockstar

Level 20

Seems to show how seriously things fall apart when the network goes away!

Great article.  Although, in my experience it is usually blamed on the "Network Shop".   I would be nice to have Lessons Learned on this one and I am sure in the next year or so, it will be done. 

Fantastic article sqlrockstar​! The moral of the story is 'Don't be a D!ck!'.

Business in general has become more and more hostile, the more money is involved. Delta's CEO deserves a gigantic pat on the back for standing up on this, and his entire company is an example to all enterprises. Be nice, be honest and be responsible

My take is that this outage was power-related, at the electric utility level.  Delta might not have been able to predict that.  And if this was caused by power, calling in enough 18-wheeler generator trucks, wiring them in, starting them up, and transferring the load to them--and migrating off of them later--likely couldn't have been done in the time frame given.

My organization has D.R. plans to accomplish exactly this, and I've seen the plan implemented, generator trucks quickly called up from a major metropolitan area two hours away, backed into place and connected to a hospital complex that occupies two city blocks.  It works.  It's not cheap.  And it doesn't happen in the amount of time it takes to get the trucks here.  Because this DR solution is one based on double-failures--where city power is unavailable and backup generators in a hospital complex are at risk or being replaced.

It would be interesting to learn how Delta and the power company analyze the cause of the failure and future resilience and redundancy to avoid this outage next time.

A real-time-replicating dual-data-center redundancy solution might not be practical or affordable until a lot of money and time are available to design and implement it.

Remember, when you prepare your Disaster Recovery solutions and plans, to test them regularly.  A good plan can be worthless if you haven't tested every aspect of it in real life.  Maybe your 18-wheeler generator trucks won't fit in the alley.  Maybe their connectors aren't compatible with your buildings' outlets . . .

Level 12

Well said!  HA != DR and even the best systems have holes, weaknesses, and/or contingencies that no one thought of along the way.  You plan for everything you can, and handle anything else if it happens.  We all know Mr. Murphy lives in datacenters, so the odds are ever in his ignoble favor that something virtually unforeseeable will happen.

I am forwarding this post to many here at my company.  Well said.

RT

"And while it is easy to just point to the Cloud and yell "geo-redundancy" at the top of your lungs the reality is that such a design costs money. Real money."

Probably one of the best points made in this article.

A good read/good topic. I like this one, sqlrockstar​ . I've completely stayed out of the delta scenario - I had only heard about it in passing and hadn't read anything. With what you've posted, I think it makes sense. I think I had a different opinion before I read up on this, but it really was an armchair view.

As a seasoned BCP'er the first thought that popped into my head was, "They didn't commit to their BCP/DR strategy." Also, I read that the weekly scheduled generator test caused the fire. My questions as an auditor/BCP'er:

  1. Is Monday morning the best time to schedule that weekly maintenance given that Monday is a high volume travel day?

  2. What are your RTO's/RPO's?

  3. What are your plans if the building catches fire? (There are no follow-up "What if..." questions)

  4. When is the last time you have failed over and ran PRD out of another datacenter?

The remaining questions would be around how to restore processes quickly, etc.

Level 17

Peter,

Thanks for the comment. I heard about the fire late last night, I'm still looking for some facts. I haven't found any reference to a weekly test, please pass along a link if you have one available. The sources I found said the fire happened as a result of the initial failover, but no mention of it being a test.

I'm not certain that they didn't commit to their strategy, I think they did *try* to commit, but things failed. The CEO mentioned that during the failover the systems were "unstable", which I take to mean either corrupted or just out of sync. Therefore a reboot was needed.

My follow-up to your points:

1. Delta is a 24/7 shop, what time of day/week would you suggest is best? I've no data to say that Monday is busier than a Friday, for example.

2. I'd love to know these as well. Considering SWA was out for 12 hours when they had to reboot recently, I'm wondering about the industry as a whole, too.

3. Iron Mountain, I would guess. If the entire DC burns to the ground, then you better have tapes stored somewhere.

4. I'm guessing they would be doing this 2x a year, at least. That used to be the standard for finance, at least.

Great questions and I'm hoping we can keep getting some facts from Delta, and maybe even perform our own post-mortem.

Thanks!

Level 11

Good read. Thanks!

Level 14

Great write up.  As a network security guy, my thoughts automatically turn to the dark side.  Delta isn't the first airline in 2016 with a crippling outage and there are anti-capitalist groups out there that would love to take down an airline.

Level 12

I think it's fine for outsiders to try to guess what went wrong...if they do that in a way that shows that they are clearly just guessing.  IT is the only technical profession that does not share, openly, when and why failures happen as part of their profession.  Failure analysis in engineering, medicine, architecture is built into their processes, even publishing studies and outcomes.  It's part of what defines them as professions, not just trades.

Level 17

Well, blaming the outage on outsourcing wasn't presented as a guess, it was presented as "they've outsourced, and now this happened, and Delta won't return my calls, so I'm going to present this as fact". But I suppose that's what passes for journalism these days.

I wonder why IT doesn't share such information openly? It's clear that we all have failures, why not try to learn from one another? Instead we keep everything in secrecy and then we wonder why companies are being hacked, or why the backups can't be restored.

Level 12

Mostly it's due to the fact we don't have licensing.  And we likely never will.  I think we should, but at the architect and design level, not the builder level. That's how it works with engineering and architecture.

MVP
MVP

Fascinating story, well presented and written. Thank you!. It would be great to see "[Update: dd/mm/yy]"s at the bottom of the article itself to quickly catch-up with latest events (ye, I know, comments and Google are always there, but having short snippets in the article itself would make it much easier to follow all sequence). Thanks again for sharing!

Level 13

Definitely important to understand the difference between DR and HA...and although very difficult when tons of people are breathing down your neck...staying calm keeps everyone else level headed.

About the Author
Thomas LaRock is a Head Geek at SolarWinds and a Microsoft® Certified Master, SQL Server® MVP, VMware® vExpert, and a Microsoft Certified Trainer. He has over 20 years experience in the IT industry in roles including programmer, developer, analyst, and database administrator.