This past Monday morning Delta suffered a disruption to their ticketing systems. While the exact root cause has yet to be announced I did find mention here that the issue was related to a switchgear, a piece of equipment that allows for power failover. It's not clear to me right now if Delta or Georgia Power is responsible for maintaining the switchgear, but something tells me that right now a DBA is being blamed for it anyway.
The lack of facts hasn't stopped the armchair architects taking to the internet the past 24 hours in an effort to point out all the ways that Delta failed. I wanted to wait until facts came out about the incident before offering my opinion, but that's not how the internet works.
So, here's my take on where we stand right now, with little to no facts at my disposal.
HA != DR
I've had to correct more than one manager in my career that there is a big difference between high availability (HA) and disaster recovery (DR). Critics yesterday mentioned that Delta should have had geo-redundancy in place to avoid this outage. But without facts it's hard to say that such redundancy would have solved the issue. Once I heard about it being power related, I thought about power surges, hardware failures, and data corruption. You know what happens to highly available data that is corrupted? It becomes corrupted data everywhere, that's what. That's why we have DR planning, for those cases when you need to restore your data to the last known good point in time.
This Was a BCP Exercise
Delta was back online about six hours after the outage was first reported. Notice I didn't say they were "back to normal". With airlines it takes days to get everything and everyone back on schedule. But the systems were back online, in no short part to some heroic efforts on the part of the IT staff at Delta. This was not about HA, or DR, no, this was about business continuity. At some point a decision was made on how best to move forward, on how to keep the business moving despite suffering a freak power outage event involving a highly specialized piece of equipment (the switchgear). From what I can tell, without facts, it would seem the BCP planning at Delta worked rather well, especially when you consider that Southwest recently had to wait 12 hours to reboot their systems due to a bad router.
Too Big To Failover
Most recovery sites are not built to handle all of the regular workload, they are designed to handle just the minimum necessary for business to continue. Even if failover was an option many times the issue isn't with the failover (that's the easy part), the issue is with the fallback to the original primary systems. The amount of data involved may be so cumbersome that a six hour outage is preferable to the 2-3 days it might take to fail back. It is quite possible this outage was so severe that Delta was at a point where they were too big to failover. And while it is easy to just point to the Cloud and yell "geo-redundancy" at the top of your lungs the reality is that such a design costs money. Real money.
Business Decisions
If you are reading this and thinking "Delta should have foreseen everything you mentioned above and built what was needed to avoid this outage" then you are probably someone that has never sat done with the business side and worked through a budget. I have no doubt that Delta has the technical aptitude to architect a 21st century design but the reality of legacy systems, volumes of data, and near real-time response rates on a global scale puts that prices tag into the hundreds of millions of dollars. While that may be chump change to a high-roller such as yourself, for a company (and industry) that has thin margins the idea of spending that much money is not appealing. That's why things get done in stages, a little bit at a time. I bet the costs for this outage, estimated in tens of millions of dollars, are still less than the costs for the infrastructure upgrades needed to have all of their data systems rebuilt.
Stay Calm and Be Nice
If you've ever seen the Oscar-snubbed classic movie "Roadhouse", you know the phrase "be nice". I have read a lot of coverage of the outage since yesterday and one thing that has stood out to me is how professional the entire company has been throughout the ordeal. The CEO even made this video in an effort to help people understand that they are doing everything they can to get things back to normal. And, HE WASN'T EVEN DONE YET, as he followed up with THIS VIDEO. How many other CEOs put their face on an outage like this? Not many. With all the pressure on everyone at Delta, this attitude of staying calm and being nice is something that resonates with me.
The bottom line here, for me, is that everything I read about this makes me think Delta is far superior to their peers when it comes to business continuity, disaster recovery, and media relations.
Like everyone else, I am eager to get some facts about what happened to cause the outage, and would love the read the post-mortem on this event if it ever becomes available. I think the lessons that Delta learned this week would benefit everyone that has had to spend a night in a data center keeping their systems up and running.