Amazon S3 outage spotlights disaster recovery tradeoffs

Tuesday’s Amazon S3 outage reverberated around the Internet, but cost and complexity will likely keep many users from scrambling to change their redundancy practices. One of the largest service disruptions to hit AWS in years would have had far less impact if customers added more redundancy safeguards, but many cloud ...

Tuesday’s Amazon S3 outage reverberated around the Internet, but cost and complexity will likely keep many users from scrambling to change their redundancy practices.

One of the largest service disruptions to hit AWS in years would have had far less impact if customers added more redundancy safeguards, but many cloud customers are only willing to go so far to keep their workloads running seamlessly.

Companies can implementĀ myriad contingenciesĀ to safeguard against massive cloud outages. AWS added Cross-Region Replication in 2015; IT shops also can rely on a range of disaster recovery as a service tools on the market. There also are techniques to spread workloads across regions and to back them up in other public clouds or on-premises. Netflix, which as recently as last year said it used U.S. East-1, champions several of these techniques and reported no issues Tuesday.

But the Amazon S3 outage, which lasted four hours in the U.S. East-1 region, managed to take down or slow huge chunks of the Internet Tuesday after Amazon Simple Storage Service (S3) became unresponsive because of human error and outdated debugging techniques. It also had a knock-on effect that took down multiple other services that depend on S3. The net result was 54 of the top 100 Internet retailers saw a 20% or greater decrease in performance, according to Apica, a Web monitoring provider, and it cost S&P 500 index companies $150 million, according to cyber-risk startup Cyence Inc.

And a high number of companies that didn’t fail over or couldn’t maintain services without interruption suffered. Nike, which has given talks at AWS conferences on security and redundancy, saw load times on its website increase by 642%, according to Apica. There were reports that Apple’s iCloud experienced slower performance, even though the product reportedly relies on Microsoft Azure and Google Cloud Platform, too.

There’s an inherent leap of faith that comes withĀ passing uptime responsibilityĀ to a cloud vendor, but the far-reaching effects of this incident show many customers are willing to work without a net, to a certain degree, after weighing the cost and complexity that comes with such high levels of redundancy.

“Where do you want to draw the line on redundancy on and uptime?” asked Craig Loop, director of technology at Realty Data Co., a Naperville, Ill. financial services company. “It literally is a dial-your-redundancy system, and all you have to do is throw money at it.”

Realty Data considered the use of multiple regions previously, but ultimately decided to stick with U.S. East-1 because of the additional cost and development that would come with preparing for outages that would happen once every couple years.

There certainly are companies that require that level of uptime, and doing it through AWS is considerably cheaper than more traditional methods, said Carl Brooks, an analyst with 451 Research. But many users decide living through the occasional outage is part of the cost of doing business.

“It might cost $500,000 to implement multi-region stability with high availability and AWS best practices, but a four-hour outage may cost you $60,000,” Brooks said.

Companies’ responses to this latest downtime may also reflect the type of workloads that reside in the cloud. For all its exponential growth and exciting new capabilities, public cloud largely remains the domain of test and development, startups and websites — many of which may be willing to stomach an outage in a way that would be unacceptable with traditional mission-critical applications.

“I don’t think anybody died or we didn’t get to the moon,” said Jason McMunn, of Transfigure Partners, a cloud migration company in Philadelphia. “It’s real first-world problems where I couldn’t load my GIFs on Slack.”

Still, the service disruption did affect McMunn. He was in the middle of a sales demo that relies on S3 when the service cut out. It was a potential client he’d spent considerable time trying to meet.

“We finally got a chance to showcase this toolset to them and, thanks to AWS, we just looked like idiots,” McMunn said.

The company also relies on S3 for all its DevOps projects, and there was a cascading effect where developers had to revert to email to manually send updates to each other. He estimates the S3 problem will translate to 50 people hours of work, but it would take a multi-day outage or the loss or destruction of data to get him toĀ move off of the public cloud.

“I feel like my gold was still safe in Fort Knox,” he said. “I just couldn’t get at it.”

There’s also a psychological component to how businesses react to cloud outages. When downtime is isolated to a single company’s data center, IT pros become the bad guy. But that’s not the case when everyone else is down, too, Loop said.

“It has an effect where we’re all in this together so people aren’t so upset and animated about it,” Loop said. “Now it’s more — let me know what I can do and get it back up. It’s changed what an outage means.”

To replicate or not to replicate
These types of incidents serve as a reminder to architect environments in ways that best protect workloads from downtime, even if that doesn’t include cross-region replication, said Kevin Felichko, CTO of PropertyRoom.com. The online auction company houses the majority of its workloads in U.S. West-2 and noticed no issues with its production workloads. The biggest impact it saw was to some test and development in U.S. East-1 and to third-party support services.

PropertyRoom.com moved to AWS more than three years ago and opted to use the West Coast region, despite being based in Frederick, Md. U.S. East-1 may have had quicker access to new features and likely would have provided better performance due to proximity, but there was also a much higher rate of problems with the congested region, Felichko said.

“It validates us not putting mission-critical in U.S. East-1,” Felichko said. “ rolls out nice features there and they’ve got strong presence there, but it’s not as stable as other regions.”

Replication isn’t foolproof, either; companies could overlook certain scripts that reside only in one region or if they have a central source for authoritative transactions in a single region. Even if a companyĀ has replication policiesĀ in place, it still may not conduct a failover. Financial Industry Regulatory Authority Inc., which AWS cited as a reference customer when multi-region capabilities were added, opted to ride out the AWS S3 outage.

“We were in communication with AWS during the outage, were confident that recovery was near and opted not to fail over to another region given the brief impact the outage had on our operations,” a FINRA spokesperson said.

ACI Information Group, an aggregator of social media and blogs, has all its workloads in U.S. East-1, but used to have servers on the West Coast for replication. The company ultimately terminated those instances.

“Duplicate news stories are a big deal for us, and split regions had duplicates all the time,” said Chris Moyer, vice president of technology at ACI Information Group and a TechTarget contributor. “That was more likely than Amazon failing, so it’s a toss-up and comes back to the whole question about security versus making it easier for users.”

At one point, ACI looked at Cross-Region Replication for failover but was dissuaded by AWS, which told them the service was better suited forĀ getting data closer to usersĀ because the likelihood of a region-wide outage was so small, Moyer said.

“They told us don’t worry about it,” he said.

Disaster Recovery

Amazon, cloud, computer, cyber, internet

Sponsored Content
Featured Video

Webinars, Podcasts & Videos

Business Continuity Webinar

Did You Miss Our Latest Business Continuity Webinar?

It's not too late! You can still watch the ā€œBusiness Continuity Exercise Planning and Facilitation Techniques To Start Nowā€ video webinar.

facility resilience webinar

From Prevention To Action: The Role Of Facilities Management In Handling Emergencies And Maintenance

This free webinar on facility resilience will provide actionable strategies to safeguard assets, protect lives, and ensure operational continuity.

adaptive decision-making

Listen Now: Decision-Making During A Crisis

Robert C. Chandler, Ph.D, Founder and Principal of Emperiria discusses his research on adaptive decision-making in this podcast.

Receive the latest articles in your inbox

Share to...