In July, an update put out by security firm CrowdStrike, which was sent out to around 8.5 million Microsoft Windows devices, crippled IT systems around the world. The impact was immediate, with trains and planes grounded and many organisations – including hospitals, retailers and banks – left unable to function.
The threat of a major outage is significant enough to give any IT professional sleepless nights, but this is compounded by the risk of legal action. “An IT outage that leads to service disruption might constitute a breach of service level agreements, which could result in penalties, refunds or other compensatory measures,” says Shane Maher, managing director of managed services provider Intelliworx.
“In addition, businesses in certain industries – such as professional services, healthcare or other sensitive sectors like payments processing – will have strict security standards that must be adhered to. An IT outage affecting compliance with these standards can again result in major fines and penalties.” Additional risks come from regulations such as GDPR, where organisations must meet strict guidelines around how they respond to a breach.
The reality is that, in the case of technology suppliers such as Microsoft and other large providers, organisations have relatively little control over the implementation of updates, says James Watts, managing director at Databarracks. But that doesn’t mean there’s nothing businesses can do in such situations, or lessons they can learn.
“You need to know how your organisation can continue to operate if an IT service or application fails,” he says. “In the case of systems outside your control, that will often mean manual workarounds to maintain operations.
“With many software-as-a-service products, you can take a backup of your data which serves your governance risk and compliance purposes, but you can’t run that application anywhere else. Practically, you are waiting for the supplier to bring the service back online. For cloud services at the information-as-a-service level, you have the control to build in as much resilience as you are willing to pay for. It’s a balance of cost and risk, so you chose your solution based on uptime requirements and risk appetite.”
Not the same across the board
With other updates, though, organisations have more options. “By architecting a system correctly, you can significantly reduce the risk posed by relying on services that are outsourced, or out of your control,” says Tony Hasek, CEO and co-founder of physical network isolation cyber company Goldilock. “Network segmentation, for example, is a crucial layer of protection that IT teams can implement to ensure updates or changes made by external providers are ‘accepted’ internally before being rolled out.”
An example would be an airport that uses edge devices with a user port and a separate admin port, which is used to update systems and applications. “Any updates through the admin port will be stopped and subject to internal review before being rolled out,” he says. “This prevents forced updates from external service providers.”
Adopting a staggered approach to software and configuration updates is another option. “Updates are sent first to a small pool of devices and the effects observed via telemetry,” explains James Kretchmar, SVP and chief technology officer at Akamai. “Updates then proceed to wider stages of deployment only when it’s clear the effects have been positive. Keeping small problems from becoming big problems is the name of the game.” But a recent survey – ironically by CrowdStrike – suggests that only 54 per cent of organisations review major updates to software applications.
What can we learn from the CrowdStrike incident?
There are other lessons organisations can learn from the CrowdStrike incident, particularly around how they respond to any kind of outage. “Treat the risk of failure as a ‘when’ not an ‘if’ problem,” advises Dafydd Vaughan, chief technology officer at Public Digital and co-founder of the UK Government Digital Service. “This means thinking about how you can quickly restore or recover services that are affected, and how you can minimise the disruption while that recovery happens.”
This involves both understanding what the most critical IT elements are, and simulating an attack to test the response, says Adam Stringer, a digital resilience consultant at PA Consulting. “In the heat of an outage, you need to understand which services to focus efforts on, including technology, process and suppliers that support those services,” he says. “Simulating an outage – be it cyber-attack, failed change or supply chain failure – will mean you’re better prepared when the unthinkable happens.”
There are other steps organisations can take, to reduce their exposure to particular services. “In this case, having failover systems that leveraged Apple or Linux software could have prevented the outage, or at least significantly reduced the downtime,” says Wes Loeffler, director of third-party risk management at Fusion Risk Management. “In the case of the CrowdStrike Falcon endpoint security software, there are alternatives that are frequently used, such as Cisco Endpoint Security, SentinelOne, Trellix and others.”
Yet while these may provide more control, there are often trade-offs which make them unsuitable for most organisations, says Watts. “Using less popular technologies raises different challenges with skills, support and interoperability,” he contends.
“The cloud market is an oligopoly, which introduces concentration risk challenges because of the interdependency of the supply chain. You may not use AWS directly, but inevitably – whether it’s your suppliers or SaaS providers – someone in your supply chain does. It’s now even more important to look at your supply chain and downstream dependencies.”
Vaughan points out that all major cloud providers have faced major outages in the last year, and argues the bigger picture is one of more frequent and disruptive cyber-attacks. “At the same time, our services are becoming ever more interconnected,” he says.
“This is powerful and brings extraordinary value to businesses, but it can also mean that isolated problems cascade into waves of impact far beyond their initial blast zone. An incident like CrowdStrike, whether deliberate or accidental, will happen again. Businesses now must plan how they will handle it.”
There are lessons here for cloud providers, too, says Aron Brand, CTO of CTERA. “The cloud industry needs to aim for a new benchmark: space-grade reliability,” he asserts. “Space technology, designed to operate in the most unforgiving environments with minimal opportunity for physical intervention, represents the pinnacle of reliability engineering. By aspiring to this standard, cloud providers can push themselves to implement more rigorous testing, redundancy and fail-safe mechanisms that go far beyond current practices.”
Read more
Why the next Ashley Madison is just around the corner – Jason Haworth, Chief Product Officer at Botguard, urges businesses to take steps to avoid falling victim to the next big data breach
Keys to effective cybersecurity threat monitoring – A strong cybersecurity threat monitoring strategy that evolves with current and prospective threats is crucial towards long-term company-wide protection
Why diversity matters when recruiting cybersecurity staff – Putting diversity at the heart of your cybersecurity team helps you spot issues and problems that might not have occurred to you