Internet outages are increasing in frequency and severity as shown by the Amazon S3 outage last month, Dyn’s outage last fall, the Amazon Web Services (AWS) outage in June 2016, and the Salesforce 24-hour outage in May 2016.
The most recent of these outages, involving Amazon’s S3 storage solution, was caused by something people in computer operations are familiar with: operator error.
The two and a half hour outage began with symptoms of increased error rates for requests in the US-EAST-1 region. It eventually affected websites and applications like Slack, Trello, Netflix, Reddit, Quora, and others.
Creating the five nines of performance uptime (99.999%) doesn’t happen without careful planning. There are best practices that application developers should keep in mind that will greatly reduce the likelihood of a business-impacting outage.
The risk of single provider cloud solutions
Cloud computing offers significant advantages, including cost savings and scalability versus the traditional data centre. Major industry players like AWS and Microsoft entice customers to bring an existing application stack to their bundled cloud suite of app, compute, storage, network, and analytics.
>See also: HSBC suffers IT outage
The convenience of this single provider solution is outweighed by the potential risk of having a single point of failure in your application stack.
The consolidation of cloud service providers within the industry adds to the potential single point of failure as a distributed vendor solution suddenly becomes a single vendor solution due to mergers and acquisitions.
It is becoming not only more difficult to diversify the technology stack, moving away from a single threaded setup, but it is also becoming more important to diversify to avoid the single point of failure potential.
Major outages are no longer a rare occurrence and the financial and reputational impacts of outages are well documented.
Gartner’s Andrew Lerner suggests that enterprises pay, on average, $300,000 per hour, but these costs may be much higher depending on your unique business.
Using this estimate, the AWS S3 outage cost each customer roughly $750,000 each for two and a half hours of downtime. Last year, Dyn’s 18-hour outage, due to the massive Mirai botnet, likely cost the company’s customers $5.4 million each.
Four components for high availability
The goal is to have and maintain a highly available application. There are essentially four core components involved:
Redundancy: Having primary or secondary systems in place to take over the job when another system that performs the same function goes down.
Routing: The ability to route traffic to the optimal end point, and the ability to reroute traffic to another end point if the primary endpoint is unavailable.
>See also: 3 steps to avoiding outage disasters
Scale: The ability to automatically provision new infrastructure based upon load.
Backup and recovery: Having the ability to restore data, configuration, and other functions to a pre-event state.
Best practices for building a high-availability cloud application
1. Carefully choose which components of the stack should be the first to migrate to high availability.
Building high availability into the application stack brings significant technical, operational and financial challenges. This makes some components better candidates than others to operate in a redundant or high-availability set-up.
Application developers should develop scoring mechanisms to determine which components to address first based upon:
• Impact that a failure will have on users, the application and the business.
• Time, cost and ease to complete the project.
• Likelihood that the component will fail.
2. Manage third-party risk.
Cloud computing is relatively new technology. This creates an interesting situation where your cloud application is reliant on other cloud applications in order to service your customers.
While you may architect your application with the proper routing, scale, redundancy, and backup and recovery systems, if your application leverages a cloud service—and it is likely that it does — that cloud service also needs to follow established best practices for highly available cloud application will go down.
From the perspective of your customers, risks inherited from cloud service providers (CSPs) that impact your application are third-party risks. Managing third-party risk, in many ways, comes down to trust.
When your application goes offline due to your CSP, you lose trust in your provider – and your customers lose trust in you. Invest in network and application monitoring solutions that can help you evaluate your CSP’s performance.
>See also: The cloud is great, but what happens when it goes down?
3. Determine what should be in the cloud, in a hybrid set-up, or remain on-premises.
Cloud computing offers many benefits, but not all systems and data should be deployed onto a cloud solution. Your business may have compliance or regulatory considerations, sensitive data, or the plain need for more control over your data. This may mean that parts of your application need to be set up in a more traditional or hybrid model.
4. Prepare for failures across several levels.
In the February outage, only a single AWS service in a single region experienced a disruption. However, the impact was widespread. Preparing for failures at different levels of your architecture can help you avoid an outage.
Servers, even with proper maintenance, can go down. Ensure you have auto-scaling, internal load balancing and database mirroring in place.
To prepare for zone failures, make sure you have a least two zones and that you are replicating data across those zones. In addition, global load balancing can help you automatically route traffic away from a zone that is down and to a zone that is up.
Preparing for cloud failures can be more challenging. If your cloud solution is a single point of failure for your network and it goes down, you go down.
Where possible, implement a hybrid IT set-up for critical services and, where technically and financially feasible, have a backup cloud service provider
5. Use a DNS provider with Health Checks and Traffic Management abilities
DNS uses domain names, like “google.com,” to send users and application traffic to the proper endpoint so people don’t have to remember the strings of numbers that make up IP addresses.
Intelligent DNS solutions can be used to dynamically shift traffic based on real-time user, network, application, and infrastructure telemetry – including if a component of your infrastructure, like AWS S3, goes down. Intelligent DNS will ingest the telemetry and automatically reroute the application’s traffic to a healthy IP address.
>See also: Why critical data can’t be hosted with just one provider: the AWS outage
DNS with health checks and automated traffic management capabilities should be a component of your technology stack. Additionally, make sure your DNS does not become a single point of failure. To truly have a highly available cloud application, you need to architect your application in a redundant DNS setup.
6. Avoid the single point of failure with redundant DNS.
Next-generation managed DNS systems offer significant built-in redundancy and fault tolerance, but all managed DNS providers have experienced problems to some degree, which affects their customers.
While it rarely happens, providers can experience a complete loss of service. Often it is the case that enterprises that have experienced a loss of DNS, then decide to bring on a second source.
In short, no system is failure-proof, so from the point of view of a subscribing enterprise, their managed DNS does represent a single point of failure. The question every enterprise should address is whether bringing in a second DNS service is worth the effort and cost.
The industry norm for a managed DNS provider availability exceeds 99.999% uptime – about five minutes of downtime per year. However, this top line number does not provide the detail needed to properly assess the business risk associated with relying on a sole source provider.
It is not clear, for example, what the probabilities and impact are of degraded performance in certain regions or of a system-wide outage of various duration. Enterprises can look at this scenario from their own perspective.
Think about what a 30-minute loss of DNS would cost your business in terms of revenue, reputation damage, support costs, and recovery. Compare that with the cost of a second-source DNS.
The cost of one outage among enterprises for whom online services are mission-critical is roughly 10 times the annual cost of a second service. That would put the break-even point at about one major DNS outage every 10 years.
>See also: Isn’t it time the internet “kill switch” myth died a death?
Summary
Avoiding single points of failure in your cloud implementation, your application stack, or your DNS service is essential to keeping your business up and running online.
There are no guarantees, but there are best practices. Keep these in mind as you move forward with a migration to cloud technology:
• Redundancy to avoid any single point of failure.
• Cloud solutions need to be high availability and include the recommended architecture for routing, scale, redundancy, and backup and recovery.
• Be aware of potential areas for 3rd party risk and mitigate them.
• Prepare for an outage at any level of your organisation or Cloud implementation.
Sourced by Alex Vayl, co-founder and vice president of business development, NS1