The recent failure of the computer systems that run part of the UK’s air traffic control system prompted a deluge of complaints from angry travellers, but also a batch of emails from technology suppliers to the press.
The fault lay, say some, with inadequate use of performance measuring tools (the system back-up was unable to meet demand); others say it was due to insufficient investment (parts of the software are more than two decades old); to a lack of testing; to insufficient use of modern systems management tools (the failure occurred during a routine restart); and, above all, to a poor disaster recovery plan.
All or any of this may be true, but they all really point to two problems: first, the managers at the National Air Traffic Services (NATS), at some point in their calculations, deemed it acceptable that failures occasionally occur, as long as they do not affect safety.
This is because it is enormously expensive to build computer systems that give more than six-nines (99.9999%) reliability.
Such systems are used in, for example, on-board computers on aircraft, and in the City; they appear not to be used in air traffic control.
The second problem is more general. Society is now almost entirely dependent on computers, yet computers are not yet entirely dependable. Earlier this year, for example, the Sasser virus caused systems across the world to crash, with air travellers once again the victim as baggage handling systems crashed.
It would be easy to preach that business needs to invest more and fix the problem.
But huge investments are not always feasible. Business continuity specialists repeatedly report that the biggest problems occur because of lack of management awareness and attention.
It is not, ultimately, about technology.