When IT all goes wrong

Put a group of IT executives on a conference panel and ask them the killer question: What is your worst IT nightmare?

Most dare not highlight their real fears, but the answers that do come back reflect a certain kind of thinking: the main production or email server crashes and the back-up has not been working for months; a serious hacking incident allows a Trojan horse to be active in the system for an unknown amount of time; a creeping database corruption means that both the production and backed up versions of core data are unreliable.

Such heart-stopping moments portray a dread of the instances when technology fails, but they do little to recognise the wider implications of IT letting the business down. And with IT sown deep into the fabric of the business, the impact of any serious IT problem is far-reaching and often very visible.

"The importance of IT to the modern enterprise screams out through high investment, the pervasiveness of the technology, our reliance on its continuing operation and the pain we suffer when it doesn't work," write business school lecturer Ernie Jordan and Luke Silcock of PA Consulting in their recent book, Beating IT Risks. "IT risks are large and pervasive. They can be large enough to bring an organisation to the point of ruin, yet most do not have systematic or comprehensive approaches to dealing with IT risk."

The kind of risk they are talking about takes many more forms than technical failure: projects are badly scoped and poorly implemented; a third-party service fails; employees engage in IT-based fraud; the back-up tapes are lost; security patches are applied without due care; vendors mislead about the capabilities of their products; new regulations force a major software change. The list goes on.

What is clear is that many of these risks are predictable and avoidable. What is also clear from the 10 cases highlighted below, is they can have a high impact on both the business and the career prospects of IT management.

1. ICI

Misaligning existing business processes with a new supply chain management system led to an inability to locate raw materials and fulfil customer orders

Over a period of two years, chemical industry giant ICI demonstrated – in spectacular fashion – the far reaching consequences of a mismanaged enterprise resource planning (ERP) software roll-out.

In May 2002, ICI Quest, the company's Netherlands-based flavours and fragrances business unit, went live at its four main sites with a supply chain management software project, Q-Star, based on SAP's applications suite. The project aimed to streamline Quest's supply chain, creating projected savings of £20 million a year by 2004.

But from the outset, it encountered major problems, particularly at the main Quest Foods division in Naarden where flavourings for products such as ice cream and alcoholic drinks were made. Most seriously, the problems with the system resulted in an "inability to locate raw materials accurately", according to ICI's CEO at the time, Brendan O'Neill, leaving staff struggling to fulfil orders and leading to a huge backlog. In an attempt to manually clear that, Quest Foods instigated a seven day working week and switched some manufacturing to other facilities.

There was no suggestion that the SAP software was at fault. The issue lay with the way it was implemented and the disruption the changes caused to existing business processes. As O'Neill said at the time: "We have implemented SAP successfully across ICI and within the Q-Star programme. The problems we experienced were unique to this [Naarden] location." The profile of the SAP roll-out was further heightened by the fact that ICI was SAP's first ever customer, back in 1972.

Over the next nine months, though, the business processes behind the foods division of ICI Quest, which had made it the leading provider of the artificial and natural flavourings that go into thousands of food products, fell apart.

The consequences were highly visible. In August 2002, ICI was forced to issue a surprise profits warning, detailing a £10 million shortfall resulting from the botched implementation. At the time, ICI's chairman characterised the problem as a mishandling of the switch from a legacy system to the new SAP system, claiming that production volumes had been restored.

But that was not the case. Although ICI had said that the software problems with the year-long implementation had been "substantially eliminated", it revealed in the first quarter of 2003 that a proportion of Quest's largest customers had now defected to competitors as a result of poor order fulfillment – a situation that was having a particularly acute impact on its high margin products.

In March 2003, there was more bad financial news flowing from the Q-Star roll out. The company predicted that sales from Quest's foods business would be down by 20% compared to the same period a year earlier, mainly as a result of business lost following the "customer service problems in 2002". Profits for Quest would be £20 million lower. The company said it would also write off £20 million in fiscal 2002 as a result of the problems and a further £5 million in 2003.

It was time for ICI to cut its losses. A review of "the strategic direction of future ERP systems development" concluded that parts of the system developed during the Q-Star project in 2001 and 2002 would now not be deployed and that it would defer the implementation of the overall restructuring and cost-saving programme, blaming the Quest supply chain system.

The City reacted angrily to the news. In one fell swoop, 39% was wiped off the company's stock market value, depriving ICI of its cherished position as one of the London Stock Exchange's top 100 companies and putting in jeopardy its place in the FTSE 100.

Initially, divisional management took the fall. Paul Drechsler, chief executive of Quest, which had revenues of £0.7 billion in 2002, accounting for 10% of ICI's total business, took the blame for the systems implementation failures.

But the bad news kept coming. With ICI listed on the New York Stock Exchange, several groups of US investors launched class action suits against the company, saying its management had issued "a series of material misrepresentations, [by stating] that they had resolved [ICI Quest's] distribution and software problems … thereby artificially inflating the price of ICI securities."

On 9 April, with the company admitting to an ongoing loss of customer confidence – and of major customers – at Quest, the responsibility for what went wrong went to the very top of the company and CEO Brendan O'Neill was forced to resign.

Did Q-Star ever get successfully implemented at ICI Quest's Foods division? There is no telling: as a direct consequence of the mishandled implementation, the food ingredients division was sold in March 2004 to its Ireland-based rival Kerry Group for £238 million – a company that had taken full advantage of Quest's supply chain problems and poached many of its key customers.

2. Sumitomo Mitsui Bank

Weakness in security policies enabled the introduction of spyware and the leaking of staff passwords

Although this was portrayed as a major success in fraud detection, it was one that should never have got that far.

Sumitomo Mitsui bank in March 2005 was forced to reveal that criminals had infiltrated the systems of its London offices with spyware. The rogue software had been placed in the bank's systems in October 2004 and for an unspecified period had been logging the keystrokes of a group of employees, allowing the fraudsters to monitor user names and passwords. In several incidents, the thieves then illicitly entered the bank's London systems, signed in as one or more employees and attempted to transfer an estimated £220 million into at least 10 offshore bank accounts.

Altered by UK authorities, Israeli police arrested a 32-year-old man, Yeron Bolondi, on money-laundering charges after an attempt to transfer £13.9m to an account in Israel.

The bank said that it had not "suffered any financial damage" and that it had "undertaken various measures in terms of security."

How did the spyware get into the system? According to one theory, contract staff working in the offices in the evening may have attached small listening devices to the keyboard sockets of some staff computers. These 'keyloggers' could then record passwords and other log-in details and be removed and their data downloaded on a nightly basis.

3. Department of Work and Pensions

Lack of change management and testing meant a local PC operating system upgrade was erroneously applied organisation-wide, disabling host access

It was not the most serious of incidents, but it will be for a long time the canonical example of how many PCs can be taken offline with a local upgrade.

On Monday 22 November 2004, a systems engineer at the UK's Department of Work and Pensions, tried to apply a Windows XP operating system patch to several local PCs as part of a "routine software upgrade". However, without the engineer realising, the upgrade spread to 80,000 of the government department's PCs across the UK (most of which were running Windows 2000) and in the process disabled access to the mainframe back-ends used for processing benefit applications, new pension claims and other core applications. A third of the DWP's computer network, including the internal email system, was affected.

The government suggested that the event was "blown out of all proportion" as regular payment systems had not been affected, and it highlighted how its disaster recovery programme had kicked in as planned. However, over the three days the systems were hit, its staff were incapable of processing 60,000 claims as normal. Unable to boot up their PCs and start Outlook mail, staff were forced to revert to communicating by fax.

EDS, which runs the DWP systems under an outsourcing contract, called in Microsoft to help identify the problem. The rolling restored was still underway four days after the erroneous upgrade.

So how could the "the biggest computer crash in government history" have been avoided? The best theory is that the engineer applied the XP patch in the form of specific Dynamic Link Libraries (DLLs) to PCs, overwriting Windows 2000 system DLLs with the XP ones. A set of change management processes and accompanying software should have ensured that adequate testing was carried out in advance of the upgrade, thereby ensuring it was only applied to XP machines.

4. National Air Traffic Services

Age of IT infrastructure meant tests were run on a live system without adequate impact analysis

It was a relatively small-scale task but one that had a nationwide impact. The National Air Traffic Services (NATS) software engineers arrived at Heathrow's West Drayton air traffic control centre before 3am on the morning of 3 June 2004. Their goal: to test an update that was due for live roll-out later that summer.

That involved bringing down the complex 30-year old live Flight Data Processing System and running a 45 minute test. All appeared to go as planned, but when the live system was restarted at 6.03am, controllers at NATS's centre in Swanwick near Southampton started to report 'errors' in flight data.

Fearing mid-air collisions, air traffic management halted all take-offs as the pre-test system was restored. But the hour-long outage triggered widespread flight cancellations and delays that hit 200,000 passengers across the UK during the morning peak.

The incident was the fourth major air traffic service failure in three years, and cost airlines several million pounds as tight flying schedules were thrown into chaos and planes and crew were stuck in the wrong locations.

5. MFI

Poorly executed implementation of a supply chain package had repercussions for the whole organisation

The 'company history' page on furniture retailer MFI's web site makes interesting reading for any observer of the problems that hit the company in 2004.

"By early 2004, a major SAP implementation project for new systems throughout the group was concluded," it reads. As if cut from the SAP brochure, it goes on to list the benefits that the system will deliver: "lower stock levels, fewer failed deliveries, improved levels of customer service and increased efficiency and profitability."

When MFI delivered a trading statement in September 2004 – warning its UK retail operations would record a substantial loss for the year – its executives had changed their tune, blaming SAP supply chain software, in which the company had invested £50 million, for the abrupt reversal of fortunes.

In particular, they said, "significant issues" with the newly implemented system, which was expected to save the company some £35 million per year, have left it unable to fulfil customer orders and had actually imposed extra costs – both in fixes to the system and in compensating disappointed customers. The costs amounted to a further £30 million. SAP was quick to deny that there is anything wrong with its software. But MFI executives paid the price of failing to oversee the implementation with due care. By September 2004, two of them – financial director Martin Clifford-King and chief operating officer Gordon MacDonald — were forced to resign.

In the end MFI paid £36 million in refunds to customers for missed deliveries as a direct result of the supply chain snafu. By February 2005, MFI was claiming that its supply chain systems were "stabilising", after "critical software" components were rewritten.

The MFI and ICI cases beg the question: Are supply chain systems an implementation minefield? To some extent they are, says Alexi Sarnevitz, an analyst at IT market watcher AMR. "Supply chain failures are attributable to some key issues: a poor business case; poor planning; poor uptake by end-users; and a weak underlying infrastructure."

6. London Stock Exchange

A lack of awareness of the complexity of the execution of batch software caused a massive data corruption

It must have been the longest day in the history of the London stock market. On the morning of 5 April 2000, in the hectic dot-com trading days and in the surge in trading that accompanies the closing days of the tax year, the core systems at the London Stock Exchange crashed.

For eight hours the exchange was paralysed, and investors, many of whom were counting on a full day's trading to buy and sell shares for capital gains purposes before the April tax deadline, were left fuming. Some even petitioned the Inland Revenue to extend the tax year by a day to take account of the exchange's systems failure.

The crash was the worst IT event experienced by the exchange since the Big Bang switch over to electronic trading in 1986. The root of the failure lay in the exchange's London Market Information Link (LMIL) that delivered real-time stock prices to the market and 80,000 terminals worldwide.

During the night of 4/5 April, one of the exchange's 400-odd batch jobs started to run slow. The program was a particular piece of code that removed old share prices from the reference data sent to clients on each trading day at 5am (three hours before the market opens).

As the batch overran, new and old prices were combined in the same file and then dispatched to clients. Just before the market opened, traders around the world started frantically calling the exchange saying they had spotted major inaccuracies in the data they were seeing on screens.

Realising there was no way they could unscramble the data (and with no back-up available), officials were forced to closed down the exchange. It took eight hours for the exchange's IT team to reboot the system and restart all the related servers that it links to. They also had to purge all the data that had been corrupted by the rogue batch job.

As a concession, the exchange extended trading till 8pm, but only half the normal volume of trades was completed by then.

What was the root cause of the problem? Investigators found that the program in question was highly inefficient in its structure and execution. Moreover, it drew on US trading information and because of the huge volumes of trades in the previous day, largely inspired by dot-com fever, it had been unable to cope.

Fixing the batch software involved rewriting a couple lines of code — "absolutely trivial," said Chris Broad, the London Stock Exchange's head of service development at the time.

But the cost was put at millions of pounds in lost transaction fees and investor's potentially lower taxes. Inland Revenue

7. Failure in capacity planning and testing of systems

The psychology of tax return filing is not a new science. A percentage of people (a large percentage) will always put the painful task off until the last minute. And online filing has only made that procrastination easier.

Over the last few years the Inland Revenue has watched as the number of people filing their tax online has grown exponentially – the cost of processing an online return is a fraction of that involved in manually processing a paper form. But the volume of online filings that started to flood in during the build up to the closing date for 2004 tax returns (31 January 2005) was somehow unexpected and was simply too much for the Revenue's systems – even after it had spent millions of pounds in advertising to encourage people to complete their returns online.

Many trying to file online in the last days of January were prompted by an error message, and callers to helplines were simply told to keep trying. Initially, the Revenue said it would not extend the deadline, but when it realised how many people were being turned away by the system it extended the deadline by two weeks, to 14 February.

Moreover, the heavy site traffic was having another negative effect. Some people filing incomplete tax returns were not presented with an error message, thereby unknowingly missing the deadline.

8. Cahoot

A failure to adequately test an upgrade resulted in a break down to the password system, exposure of customer account data and significant brand damage

On 23 October 2004, online bank Cahoot implemented a major upgrade to its service, with several key enhancements to security.

For one, to thwart spyware attacks, customers were asked to start using drop down lists to input letters of their password rather than enter the whole password on the keyboard. That would avoid keystrokes being recorded. The web site upgrade seemed to go smoothly; then 12 days later the bombshell hit.

A Cahoot customer contacted the BBC claiming he had discovered a flaw that allowed any site visitor to view other customers' accounts by simply guessing a customer's user name and altering the URL in the browser's address bar – no password was required. As the story became national news, Cahoot was forced to closed the site for 10 hours while the problem was fixed.

Cahoot said that despite the seriousness of the loophole, none of its 650,000 customers were likely to lose money as the lapse in password security was only partial and did not enable the transfer of money between accounts. However, the breach shook confidence in the bank's security system and was referred to the Data Protection registrar.

Tim Sawyer, head of Cahoot, suggested the bank was humbled by the experience. "I believe that we need to look closely at our processes because this has not been our greatest moment," he said at the time. "We did not fail as an organisation because there was no risk of financial loss, but we do need to learn lessons from this."

One lesson lay in damage limitation. Many customers were not so much outraged by the breach as upset that they heard about it in the media rather than by Cahoot informing them directly.

"As soon as we discovered it, we did testing and sorted out a solution. It has now been fixed and thoroughly tested," the company said as it closed the stable door.

A solution, according to Tim Pickard, strategic marketing director for RSA Security, was clear: "Username and password security is totally inadequate for today's ecommerce. [They should] have committed to the next generation of secure log-on services: strong, two-factor authentication, incorporating something that the user knows and something that the user has, would dramatically improve the security of consumers in this type of environment."

And Vik Desai, CEO of web application security provider Kavado, added another view: "This security breach could easily have been prevented by installing web application firewalls which prevent applications allowing unauthorised access, even in the event of the IT department making a mistake."

"In this instance the technology would have prevented access to account details without the user name and password being supplied, and secondly would have alerted the bank to the security problem in the system upgrade," he says.

9. Barclays

The introduction of new hardware without a full understanding of its potential impact triggered a massive shut down of systems

At first it was though that someone at Barclays had put the clocks back instead of forward. At 1am on Sunday morning, 27 March 2005, the bank's 1,500 ATMs through the south of England suddenly went blank. The timing was bad: Easter Sunday.

The failure of "a small piece of hardware" caused a chain reaction that brought down the mainframe and server infrastructure that powers the ATM network and Barclays' web and telephone banking. Moreover, business continuity systems failed to kick in.

With many key staff on holiday, it took the company 18 hours to bring the systems back up, during which time around five million customers had no access to their accounts.

10. HFC Bank

Weak customer emailing policies caused the distribution of personal data

Customers of HFC Bank were left confused and annoyed in September 2004 when an email error spammed their personal details to thousands of other account holders.

The problem occurred when HFC sent "urgent" emails to 2,600 customers, but an email setting meant that each address was visible to everyone else on the list.

That was not an issue for most people, but when the email arrived at customers who had set up "out-of-office" replies, often containing address and other personal details, those were forwarded to everyone on the list.

The bank credited affected people's accounts with £50 compensation.

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and... More by Ben Rossi

Ben Rossi

Related Topics

Related Stories

Andrew McAfee – ‘Human beings are chronically overconfident’

Keys to effective cybersecurity threat monitoring

How businesses can vet their cybersecurity vendors

Five key signs of a bad MSP relationship – and what to do about them

Related Stories

Outsmart the skills gap crisis and build a team without recruitment

What the UK’s new AI Opportunities Action Plan means for tech jobs

Why CISOs need to pay attention to geopolitical trends

What does leadership in a hybrid world look like?