British Airways has been having a rough year. Between their evening of downtime back in April and their catastrophic bank holiday IT meltdown, the company has suffered millions of pounds in lost business.
It’s hardly alone in that respect: dozens of companies, from Skype to HSBC, have suffered from unexpected outages and extended downtime in recent months, causing untold financial losses.
For the Fortune 1000, the average cost of unplanned application time per year is between $1.25 billion and $2.5 billion. Despite this cost, most businesses assume that downtime is unavoidable, that it’s simply part of the costs of doing business online. However, that’s simply not the case.
>See also: Downtime is key cost of ransomware attacks
While monitoring solutions to quickly tell you if your service is down have been around for years, technologies such as predictive analytics and machine learning have only recently reached a point where we’re able to not only know when we are offline, but anticipate when we will be offline.
Don’t react, predict
Current methods of managing service downtime are stuck in the past – literally. Existing monitoring solutions tend to focus on tracking anomaly detection, outlier detection and basic uptime metrics. These are certainly important metrics to track, however on a fundamental level they’re examining past faults rather than future performance.
The old saying applies just as well here: prevention is better than a cure. When systems’ downtime can create untold levels of expenditure from businesses, preventing that downtime from ever occurring should be the ultimate goal. This may sound unrealistic, but, thanks to new technologies, it’s closer than you think.
Machine learning: behind the buzzword
Machine learning is a buzzword: there’s no escaping it. However, even the most well-used buzzwords can still conceal a real value in the underlying technology. This is true of machine learning: this diverse set of technologies can be applied to myriad different problems, however they have not yet been used to improve service uptime.
In the past it was possible for operations engineers to become “experts” in specific technologies and use their knowledge and experience to anticipate problems within their largely static infrastructure platforms.
>See also: The real damage of a ransomware attack is felt in the downtime
Today our platforms are much more complex and continuously changing – whether that be through continuous delivery of applications or elastically scaling platforms. It’s no longer possible for a human to anticipate every failure scenario in these complex continuously changing systems. The way the security industry monitors and manages these platforms needs to be automated too and that’s where machine learning comes in.
Machine learning algorithms can be trained using historical platform metrics to provide insights that would be near impossible to spot by inspecting metrics through traditional dashboards and graphs. Correlating events behind the scenes to help identify problems faster, before they become incidents and impact your customers.
In the meantime… make a plan
What’s quite apparent in all these recent instances of unexpected downtime is how unprepared the affected business have been by their scale and severity. While machine learning solutions may be the cutting edge, they are limited in the help they can provide, typically requiring a lot of configuring and a maths background to optimise the models and understand the output.
Prepare for the worst by identifying your different backup methodologies, be clear about the order of priority in which they will be used, and regularly check that the backups actually work.
>See also: The Internet of Things and the consequences of downtime
Make it clear to all the relevant people who should be contacted in case of emergency, and develop a simple checklist of issues to tackle – these kinds of features might sound obvious, but in a crisis people can behave unpredictably, so it’s important to have as many processes as possible codified and tested prior to the event.
Society is currently undergoing a moment of intriguing business and social transformation, one where, due to ever-growing reliance on cloud-based applications, backend infrastructure and uptime metrics are becoming critical to business revenue security.
Each high-profile service outage causes significant brand damage to companies, and we’ll see more businesses invest serious capital into preventing these disasters from happening. The companies with the best reliability will be those who have invested in modern operational practices, a considerable competitive advantage.
Sourced by David Mytton, founder and CEO of Server Density
The UK’s largest conference for tech leadership, Tech Leaders Summit, returns on 14 September with 40+ top execs signed up to speak about the challenges and opportunities surrounding the most disruptive innovations facing the enterprise today. Secure your place at this prestigious summit by registering here