Delta Air Lines, widely regarded as one of the most reliable U.S. carriers, recently suffered a technical meltdown when a massive system power outage forced them to cancel thousands of flights and delay many thousands more.
What happened, and how can you prevent it from happening to your business?
First, let’s identify what caused the power outage.
Initially, it was reported the cause was an external power failure in Georgia.
Further investigation revealed that the problem was a malfunction in Delta’s internal power control module, the equivalent of a house’s main fuse box. In such a situation, Delta’s 7,000 servers are supposed to switch over to a backup power supply.
Unfortunately, around 300 of the Delta servers were not connected to the backup power supply. When the other 6,700 servers came back up, they could not contact those 300 servers, crippling the functionality of the entire system.
Don’t be fooled by digital transformation
Digital technologies are fundamentally changing the way companies do business.
Today, consumers demand that their suppliers be agile, innovative, people-oriented and connected. They expect to be able to instantly interact with companies 24×7, and demand full transparency at all stages of the interaction.
>See also: IT power outage cripples Delta airline
Delta is certainly no slouch in this area, and has won awards for the quality and capabilities of their online apps.
But while, digital transformation may have transformed the consumer-facing front end of a company’s computer systems, it didn’t necessarily change the data-crunching back end.
Back in the server farm, we may find incompatible systems, the result of inorganic growth, glued together with digital scotch tape and baling wire. We may find systems based on long-extinct technologies that no one in the organisation even knows how to maintain or debug.
Digital transformation is vital to a company’s survival, but it is no guarantee of the end-to-end health of the back end computer infrastructure.
Identify and reduce your technical debt now
So how do you evaluate the state of your back end infrastructure?
Technical debt is a useful concept that can help frame the discussion. If you borrow money and do not repay on time, you incur monetary debt that, due to interest, only increases over time.
Similarly, if you solve a computer problem with a shortcut rather than with the soundest long-term solution, you incur technical debt, and the cost of repaying that debt rises over time.
>See also: 4 ways to minimise and manage technical debt
Every company should take a hard look at their existing computer infrastructure, and evaluate the sources and costs of technical debt.
Some examples include; servers that do not have a backup power supply installed, or in the process of lifting and shifting your on premise backend to the cloud, you went from redundant database servers to a single server, which, if it fails, will cause data loss and significant downtime.
Over time, you find that the inability to get a combined birds-eye view of the entire business is costing you money.
In all these cases, nothing is broken today, and, with a little luck, the enterprise can probably continue running as is.
But, these situations may become very costly down the road, luck does not last forever, and the cost of fixing the situation only goes up over time.
Plan for periodic product modernisation efforts, where you lower your technical debt to acceptable levels.
It may be worthwhile to bring in an external partner, who can evaluate your system with an objective eye, and can propose cost-effective remedies based on proven state-of-the-art technologies.
Architect your system so it can recover quickly
Even after the Delta system came back up, it took many hours to re-calculate the rotation schedule of airplanes, pilots and crews from scratch.
During that time, for most of the following day, the system was largely unresponsive.
>See also: Digital transformation in 2016: how far have we come, and how far have we left to go?
This kind of problem can be avoided with the proper architecture. For example, the algorithm for calculating rotation schedule could be designed around the map-reduce paradigm, where the first stages of calculation are divided in parallel across many compute processes, then the results are combined in a final step into a single master schedule.
In such an architecture, higher speed can be obtained by temporarily dedicating more machines to perform the first processing stages in parallel.
Conduct realistic failover drills regularly
If Delta had conducted a failover drill, they would have discovered the 300 missing power supply cards, and fixed the problem at a relatively low cost.
Failover testing is often skipped, because it is expensive and because it requires a lot of planning to ensure that it does not harm ongoing business activity.
But, if you never test your planned failover processes, you have no idea whether they will really work.
>See also: The top three IT fears about DRaaS – and how you can rest easy
Conducting regularly scheduled failover drills also ensures that people know what they are supposed to do when a real emergency strikes.
Here again, it may be worthwhile to bring in an external partner who has experience in setting up and performing meaningful failover testing.
There’s no doubt that Delta’s misfortune was widespread and potentially preventable. But CIO’s in all industries can learn from its mistakes.
The whole idea is to try to have fewer points where things can go bad and the steps above can provide a path forward.
Sourced by Moshe Kranc, CTO, Ness Digital Engineering