IT Directors should be prepared… be very prepared

At the onset of summer 2003, IT directors were sleeping soundly in their beds, no longer haunted by the spectre of disaster recovery. After 9/11, the majority of large companies had revised their fail-safe back-up systems and processes. And it seemed that the IT world was more prepared than ever to keep organisations humming – under any circumstances.

Try telling that to the thousands of British Airways passengers whose flights were cancelled or delayed one chaotic morning in September, after a power failure at Heathrow shut down the computer systems responsible for check-in and baggage services. Or patients at an Ottawa cancer centre in Canada, where critical servers were waterlogged after the air conditioning system switched off and began leaking during a recent blackout. Or office workers in Sydney, Australia, shut out of their buildings for a day after a power failure deactivated computerised locking systems.

Add to these recent cases the effect of the SoBig and MSBlast computer viruses – thought to have infected up to 30,000 systems an hour during its July peak – not to mention the almost unprecedented blackout in North America the following month, itself possibly caused by a virus-damaged server, and the picture is one of failure in terms of the security, resilience and redundancy of the world’s IT infrastructure.

The problem with power

Blackout

A blackout is a total loss of utility power.

Cause: Blackouts are caused by excessive demand on the power grid, lightning storms, ice on power lines, car accidents, backhoes, earthquakes and other catastrophes

Effect: Current work in RAM or cache is lost. The hard drive File Allocation Table (FAT) may also be lost, which results in total loss of data stored on drive.

Sags

Also known as brownouts, sags are short-term decreases in voltage levels. This is the most common power problem, accounting for 87% of all power disturbances, according to a study by Bell Labs.

Cause: Sags are usually caused by the start-up power demands of many electrical devices (including motors, compressors, elevators, shop tools, etc.) Electric companies use sags to cope with extraordinary power demands. In a procedure known as rolling brownouts, the utility will systematically lower voltage levels in certain areas for hours or days at a time. Hot summer days, when air conditioning requirements are at their peak, will often prompt rolling brownouts.

Effect: A sag can starve a computer of the power it needs to function, and cause frozen keyboards and unexpected system crashes which both result in lost or corrupted data. Sags also reduce the efficiency and life span of electrical equipment, particularly motors.

Spike

Also referred to as an impulse, a spike is an instantaneous, dramatic increase in voltage. Akin to the force of a tidal wave, a spike can enter electronic equipment through AC, network, serial or phone lines and damage or completely destroy components.

Cause: Spikes are typically caused by a nearby lightning strike. Spikes can also occur when utility power comes back on line after having been knocked out in a storm or as the result of a car accident.

Effect: Catastrophic damage to hardware occurs. Data will be lost.

Surge

A short term increase in voltage, typically lasting at least 1/120 of a second.

Cause: Surges result from presence of high-powered electrical motors, such as air conditioners, and household appliances in the vicinity. When this equipment is switched off, the extra voltage is dissipated through the power line.

Effect: Computers and similar sensitive electronic devices are designed to receive power within a certain voltage range. Anything outside of expected peak and RMS (considered the average voltage) levels will stress delicate components and cause premature failure.

Noise

More technically referred to as Electro-Magnetic Interference (EMI) and Radio Frequency Interference (RFI), electrical noise disrupts the smooth sine wave one expects from utility power.

Cause: Electrical noise is caused by many factors and phenomena, including lightning, load switching, generators, radio transmitters and industrial equipment. It may be intermittent or chronic.

Effect: Noise introduces glitches and errors into executable programs and data files.

Source: American Power Conversion

 

 

Wake up call

The huge blackouts in London and North America stripped away any lasting veneer of impregnability to reveal IT systems more vulnerable than even the most pessimistic had suspected.

Companies affected by blackouts are usually reluctant to disclose their experiences, in an attempt to protect their reputation with customers and partners. But some common threads have emerged from the recent crises.

An astonishing three-in-four US companies in the affected areas surveyed by Info-Tech Research said they were disrupted by the North American blackout in some form, either directly or through one or more of their suppliers’ systems going dead. Of those, the vast majority admitted they were ill prepared for a crisis on that scale. “This blackout demonstrated that most IT departments, especially those in mid-sized companies, are still flying by the seat of their pants,” says Jason Livingstone, an Info-Tech analyst. “Disaster recovery is simply not on their list of priorities.”

That view might seem unfair, but, alas, it is not entirely. There is mounting evidence that some organisations are failing to plan for business continuity events at all; and of those with plans in place, many are failing to properly test or enforce them.

A recent survey of UK businesses – carried out by Infoconomy, the publisher of Information Age, in association with American Power Conversion (APC), a vendor of batteries, back-up systems and generators – found that 65% had suffered business disruption due to a power outage. And yet 62% still did not have an overall strategy in place for addressing such events.

“We believe that at least 90% of UK companies have no form of contingency planning in place,” says Jim Simmons, CEO of SunGard Availability Services, a business continuity specialist. “Only 8% of organisations without business continuity plans can expect to survive a ‘disaster’. For those companies, the power outages [in London and North America], occurring just weeks before the second anniversary of 9/11, must have been a huge wake-up call.”

Another problem is that companies are often failing to carry out regular simulations of business continuity events. About one-in-four IT directors in the UK either do not know when their business continuity management plan was last tested or think it was probably more than a year ago, according to a recent survey by storage vendor Hitachi Data Systems (HDS).

But even having a detailed and well-rehearsed plan in place does not guarantee protection. Human error still has to be factored in. Moreover, business continuity projects are often funded off-budget – raising the possibility that failures in IT management practices will creep into the system. The HDS study found that IT directors placed human error behind fire as the most likely cause of business continuity events. (Interestingly, this does not tally with the most common causes of disaster recovery invocations – see table, ‘Causes of UK disaster recovery invocations’.)

Of course, IT processes are far from being error-free. Internet service provider Lycos’s email services recently went down for four days, after mistakes were made during a routine job to load new backup software on to a web server.

Neil Rasmussen, APC’s senior vice president and chief technical officer, says the kinds of mistakes that IT experts can make are wide and varied. “Some protect their servers [with uninterruptible power supply] but forget about the hubs. Some overlook ‘back doors’. Many do not have sufficient ‘runtime’ [back-up power capacity]. Others don’t install the management software,” he says.

Enterprises should also guard against complacency, says John Sharp, chief executive of the Business Continuity Institute (BCI), which seeks to raise awareness of business continuity matters and has developed a code of practice for organisations on how to prevent and cope with business disasters. “There is often a sense that ‘this will never happen to us’. One common view is that, if I am sitting in an office in East Grinstead, then I’m not going to be affected by a terrorist bomb in London. Well, you’d better think again,” he says.

Lessons not being learned

When the lights went out in North America on 14 August 2003, plunging more than 50 million people into darkness, the first thing that many did was try to call family and friends on their mobile phones. When that didn’t work, the next step was to try to email. But like large sections of the wireless network, many Internet service providers were also down. It did not take long for people to realise that network operators, ravaged by the telecoms downturn, had cut investment in back-up power and diesel generation facilities. “Bad engineering? No, greed and bad financial decisions,” wrote former BT chief technologist Peter Cochrane in one particularly acerbic article.

 

Causes of UK disaster recovery invocations
Hardware/software 67%
Power failure 16%
Infrastructure/communications fault 5%
Terrorist action 3%
Environmental factor 3%
Fire 3%
Theft/vandalism 2%
Site access 1%
SunGard Availability Services
 

The cost of providing adequate business continuity infrastructure is often compared to insurance and other risk-management steps. Such thinking can have dangerous consequences, says the BCI’s Sharp, since costs can always be cut during difficult times. Preventing problems is a good idea since so many recoveries fail – some studies put the failure rate up as high as 50%. Avoiding disasters may also keep a business afloat. Surveys show that small businesses often go under while waiting for an insurance payout. Other businesses suffer fatal damage to their reputation. And it is not always possible to keep events out of the public eye. In one case in Birmingham in the early 1990s, the firebombing of a law firm was reported on the local news, prompting clients to flee in droves. The firm survived – just.

There may be a certain amount of 20:20 hindsight involved here, and admittedly, not all disasters can be easily anticipated. A case in point surrounded the initial outbreak of the SARS virus in Southeast Asia. The virus caused widespread disruption to businesses. Some IT workers in Hong Kong were quarantined in their homes. In Singapore, several banks were so worried about having their premises quarantined that they set up impromptu back-up IT departments within Hewlett-Packard’s local business continuity centre.

Now, with the SARS virus seemingly contained, Singapore’s government has moved swiftly to avert a repeat of the business disruption. From 2004, it will become the first country in the world to certify companies as complying with business continuity standards, based on a code drawn up by the BCI. But regulators elsewhere have not yet grasped the nettle. Although the BCI code has been translated into many languages, it seems unlikely that the UK and other countries will be adopting similar standards in the short term, says Sharp. At least new regulations governing records-retention and risk-management procedures, such as Sarbanes-Oxley and Basel II, may ultimately plug this gap.

Financial cost of downtime
Industry Application Average cost per hour of downtime
Financial Brokerage operations $7,840,000
Financial Credit card sales $3,160,000
Media Pay-per-view $183,000
Retail Home shopping (TV) $137,000
Retail Catalogue sales $109,000
Transportation Airline reservations $108,000
Entertainment Tele-ticket sales $83,000
Shipping Package shipping $34,000
Financial ATM fees $18,000
Hewlett-Packard
 

Best practice

As the events of the past two years have demonstrated, it is virtually impossible for a business to plan for each and every eventuality. But at least trends in technology can help. The movement in the IT industry towards redundant networks and computer systems, storage area networks and long-distance data replication should ease many of the difficulties, as should improvements in IT security, both at an application and a network level, say experts.

Brian Fowler, HP’s global director of business continuity services, is bullish about the future. “There was an upsurge in sales even before the blackouts,” he says. “The horrific events of September 11 really did show people that they need to make every effort to protect their systems.” Business continuity is now one of the leading priorities for IT directors, he says, which explains why HP has made it one of the key strands of its ‘adaptive enterprise’ strategy, underpinned by technologies such as server clustering designed to remove single points of failure.

 
 

Key objectives of effective BCM strategy

1. Ensure safety of staff

2. Maximise the defence of the organisation’s reputation and brand image

3. Minimise the impact of business continuity events (including crises) on customers/clients

4. Limit/prevent impact beyond the organisation

5. Demonstrate effective and efficient governance to the media, markets and stakeholders

6. Protect the organisation’s assets

7. Meet insurance, legal and regulatory requirements

Source: The Business Continuity Institute (www.thebci.org)

 

But even customers of companies such as HP, SunGard and APC can ill afford to sleep easy. Blackouts are usually more common in the winter, when there are greater demands on the electricity grid. Energy experts are already predicting a fresh spate of outages in the winter months of 2004, caused, it is argued, by under-investment in infrastructure since electricity markets were deregulated.

Ironically, perhaps, developments in IT and other areas of advanced electronics may be adding to the problem. Some estimates have shown that 70% of power consumption today is industrial grade, with the rest coming from sensitive electronics, such as PCs and televisions. In the next 10 years, some energy experts believe this position could be inverted – suggesting that demand for power-hungry computing devices will put even greater pressures on an already overloaded power grid.

 

Worldwide business continuity market
Year Spending
2001 $66 billion
2006 $155 billion
IDC
 
 

The BCI says that all organisations will be affected by a business continuity event of some description one day. Some will rise to the challenge; many will fail. When the crisis threatens, the worst thing an affected company can do is stick its head in the sand, says Sharp. “It is imperative to always move as quickly as the crisis. Never let it get ahead of you,” he says. Instead, organisations should quickly draw up an action plan and let customers and suppliers know that there is a potential problem, and that it is being dealt with. “Clients will always be sympathetic if they feel that something is being done about their [sales] order. Above all, they don’t want your crisis to become their crisis.”

 

The six stages of the business continuity management (BCM) lifecycle

1. Understanding your business

  • Business impact analysis
  • Risk assessment and control

    2. BCM strategies

  • Corporate BCM strategy
  • Process-level BCM strategy
  • Resource recovery BCM strategy

    3. Developing and implementing a BCM response

  • Plans and planning
  • External bodies and organisations
  • Crisis/BCM event/incident management
  • Sourcing (intra-organisation and/or outsourcing providers)
  • Emergency response and operations
  • Communications, PR and the media

    4. Building and embedding a BCM culture

  • An ongoing programme of education, awareness and training

    5. Exercising, maintenance and audit

  • Exercising of BCM plans
  • Rehearsal of staff, BCM teams
  • Testing of technology and BCM systems
  • BCM maintenance
  • BCM audit

    6. The BCM programme

  • Board commitment and proactive participation
  • Corporate BCM strategy
  • BCM policy
  • BCM framework
  • Roles, accountability, responsibility and authority
  • Finance
  • Resources
  • Assurance
  • Audit
  • Management information system : metrics/scorecard/benchmark
  • Compliance: legal/regulatory issues
  • Change management

    Source: The Business Continuity Institute (www.thebci.org)

     

  •  
     

    Spending on external business continuity services*
    1999 $210,000
    2003 $340,000
    IDC *Average expenditure by large companies in Europe
     
       

       
     

    Proportion of IT budget for business continuity
    2003 7%
    2006 10%
    IDC
     
       

    Avatar photo

    Ben Rossi

    Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

    Related Topics