Lessons Learned From the Delta Outage

sqlrockstar over 7 years ago 4 minute read time

This past Monday morning Delta suffered a disruption to their ticketing systems. While the exact root cause has yet to be announced I did find mention here that the issue was related to a switchgear, a piece of equipment that allows for power failover. It's not clear to me right now if Delta or Georgia Power is responsible for maintaining the switchgear, but something tells me that right now a DBA is being blamed for it anyway.

The lack of facts hasn't stopped the armchair architects taking to the internet the past 24 hours in an effort to point out all the ways that Delta failed. I wanted to wait until facts came out about the incident before offering my opinion, but that's not how the internet works.

So, here's my take on where we stand right now, with little to no facts at my disposal.

HA != DR

I've had to correct more than one manager in my career that there is a big difference between high availability (HA) and disaster recovery (DR). Critics yesterday mentioned that Delta should have had geo-redundancy in place to avoid this outage. But without facts it's hard to say that such redundancy would have solved the issue. Once I heard about it being power related, I thought about power surges, hardware failures, and data corruption. You know what happens to highly available data that is corrupted? It becomes corrupted data everywhere, that's what. That's why we have DR planning, for those cases when you need to restore your data to the last known good point in time.

This Was a BCP Exercise

Delta was back online about six hours after the outage was first reported. Notice I didn't say they were "back to normal". With airlines it takes days to get everything and everyone back on schedule. But the systems were back online, in no short part to some heroic efforts on the part of the IT staff at Delta. This was not about HA, or DR, no, this was about business continuity. At some point a decision was made on how best to move forward, on how to keep the business moving despite suffering a freak power outage event involving a highly specialized piece of equipment (the switchgear). From what I can tell, without facts, it would seem the BCP planning at Delta worked rather well, especially when you consider that Southwest recently had to wait 12 hours to reboot their systems due to a bad router.

Too Big To Failover

Most recovery sites are not built to handle all of the regular workload, they are designed to handle just the minimum necessary for business to continue. Even if failover was an option many times the issue isn't with the failover (that's the easy part), the issue is with the fallback to the original primary systems. The amount of data involved may be so cumbersome that a six hour outage is preferable to the 2-3 days it might take to fail back. It is quite possible this outage was so severe that Delta was at a point where they were too big to failover. And while it is easy to just point to the Cloud and yell "geo-redundancy" at the top of your lungs the reality is that such a design costs money. Real money.

Business Decisions

If you are reading this and thinking "Delta should have foreseen everything you mentioned above and built what was needed to avoid this outage" then you are probably someone that has never sat done with the business side and worked through a budget. I have no doubt that Delta has the technical aptitude to architect a 21st century design but the reality of legacy systems, volumes of data, and near real-time response rates on a global scale puts that prices tag into the hundreds of millions of dollars. While that may be chump change to a high-roller such as yourself, for a company (and industry) that has thin margins the idea of spending that much money is not appealing. That's why things get done in stages, a little bit at a time. I bet the costs for this outage, estimated in tens of millions of dollars, are still less than the costs for the infrastructure upgrades needed to have all of their data systems rebuilt.

Stay Calm and Be Nice

If you've ever seen the Oscar-snubbed classic movie "Roadhouse", you know the phrase "be nice". I have read a lot of coverage of the outage since yesterday and one thing that has stood out to me is how professional the entire company has been throughout the ordeal. The CEO even made this video in an effort to help people understand that they are doing everything they can to get things back to normal. And, HE WASN'T EVEN DONE YET, as he followed up with THIS VIDEO. How many other CEOs put their face on an outage like this? Not many. With all the pressure on everyone at Delta, this attitude of staying calm and being nice is something that resonates with me.

The bottom line here, for me, is that everything I read about this makes me think Delta is far superior to their peers when it comes to business continuity, disaster recovery, and media relations.

Like everyone else, I am eager to get some facts about what happened to cause the outage, and would love the read the post-mortem on this event if it ever becomes available. I think the lessons that Delta learned this week would benefit everyone that has had to spend a night in a data center keeping their systems up and running.

Top Comments

Jfrazier over 7 years ago +3

Every major outage that affects a companies ability to do business is somewhat different. You can plan for and have many contingencies in place to mitigate the issue but you will not have thought of everything…
ecklerwr1 over 7 years ago +3

Seems to show how seriously things fall apart when the network goes away!
gfsutherland over 7 years ago +2

Obviously the CEO was well coached in crisis management... By getting out in front... he probably helped things out to some reasonable extent. As for the wild guesses and (Should of, would of, could of…

shuckyshark over 7 years ago

Definitely important to understand the difference between DR and HA...and although very difficult when tons of people are breathing down your neck...staying calm keeps everyone else level headed.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
AlexSoul over 7 years ago

Fascinating story, well presented and written. Thank you!. It would be great to see "[Update: dd/mm/yy]"s at the bottom of the article itself to quickly catch-up with latest events (ye, I know, comments and Google are always there, but having short snippets in the article itself would make it much easier to follow all sequence). Thanks again for sharing!
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
datachick over 7 years ago

Mostly it's due to the fact we don't have licensing. And we likely never will. I think we should, but at the architect and design level, not the builder level. That's how it works with engineering and architecture.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
sqlrockstar over 7 years ago

Well, blaming the outage on outsourcing wasn't presented as a guess, it was presented as "they've outsourced, and now this happened, and Delta won't return my calls, so I'm going to present this as fact". But I suppose that's what passes for journalism these days.
I wonder why IT doesn't share such information openly? It's clear that we all have failures, why not try to learn from one another? Instead we keep everything in secrecy and then we wonder why companies are being hacked, or why the backups can't be restored.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
datachick over 7 years ago

I think it's fine for outsiders to try to guess what went wrong...if they do that in a way that shows that they are clearly just guessing. IT is the only technical profession that does not share, openly, when and why failures happen as part of their profession. Failure analysis in engineering, medicine, architecture is built into their processes, even publishing studies and outcomes. It's part of what defines them as professions, not just trades.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel