Share this content
MyCustomer.com

Cloud lessons from Amazon's data centre outage

by
27th Apr 2011
Share this content

Is Amazon's outage the end of Cloud Computing or an opportunity to learn some hard lessons? Stuart Lauchlan explores what we have learned from this extraordinary outage.

Well, that wasn't the best of weekends for Cloud Computing, was it? The black out at Amazon's EC2 (Elastic Cloud Computing) data centre has sent the naysayers into a spin, proclaiming that what was undoubtedly a hugely embarrassing incident for Amazon exposes the fundamental flaws in the Cloud model.

On the early morning of April 21 (Pacific Day Time), Amazon's EC2 data centre in Virginia crashed, taking down with it several popular websites and small businesses including social networking providers, such as Evite, Quora, Reddit and Foursquare. As of Sunday, most of its cloud customers and services were back on track, according to Amazon's AWS Service Health Dashboard.

After a weekend of being "all hands on deck", Amazon noted: "As we posted last night, EBS is now operating normally for all APIs and recovered EBS volumes. The vast majority of affected volumes have now been recovered. We're in the process of contacting a limited number of affected customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes... We are digging deeply into the root causes of this event and will post a detailed post mortem."

Sites such as Reddit were crippled, writing on its homepage: "Reddit is in 'emergency read-only mode' right now because Amazon is experiencing a degradation. They are working on it but we are still waiting for them to get to our volumes. There is no ETA at this time, but we are trying to work some magic and will very slowly be bringing the site back up. Please stand by."
Left in the dark
A seemingly common complaint from customers was that Amazon failed to address the transparency issues surrounding the outage, effectively leaving them in the dark. One complained: "A note to the Amazon customer service team. If you are going to post something, posting something meaningful or some sort of useful bit of information not just regurgitated and vague automated replies that just makes us even more confused and hopeless. Understand that this is a major outage and lots of us are losing money because of the downtime and VERY LIKELY to switch Cloud providers once this is resolved!"
But what the outage really illustrates again is that Cloud is not a silver bullet. You don't just hand over your money and forget about your systems. It just doesn't work like that. What should now come to the forefront of everyone's minds is the ongoing need to address issues such as Service Level Agreements, availability and back-up.
This shouldn't really come as a surprise, of course. Indeed Amazon's own Web Hosting Best Practices guidelines are very hot on this: "Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. As can be seen in the AWS web hosting architecture, it is recommended to spread EC2 hosts across multiple Availability Zones since this provides for an easy solution to making your web application fault tolerant. Care should be taken to make sure that there are provisions for migrating single points of access across Availability Zones in the case of failure. "
And while clearly the outage crippled and inconvenienced many customers, others were able to navigate the storm perfectly well. Netflix, for example, was quoted in the US press as coming through unscathed. A spokesman commented: "That’s because Netflix has taken full advantage of Amazon Web Services’ redundant Cloud architecture."
Lessons to be learned
Others who were knocked out, admitted that there were lessons to be learned from this incident. Paul Smith of the technical services team at blogging site EveryBlock admitted: "While the acute problem originated with AWS, EveryBlock is not without blame for this downtime. Frankly, we screwed up. AWS explicitly advises that developers should design a site’s architecture so that it is resilient to occasional failures and outages such as what occurred yesterday, and we did not follow that advice...We put all our eggs in one basket, and that basket got knocked over. All of our servers and related resources were running in the same...availability zone (AZ)...had we deployed our various servers across multiple AZs and taken into account the fact that individual servers and other services that AWS provide can and do go down from time to time, we would likely have remained available during this disruption."
It's commendable honesty from EasyBlock and the lessons that this firm has clearly taken on board will serve it well in the future.  
"We will be setting up servers in multiple AZs, and designing how they interconnect in such a way that, if one or more servers goes down, or even an entire AZ, the other servers will be able to pick up the slack and continue serving EveryBlock to users. We’re also challenging our assumptions about the various bits of our site and how they work together. Often times during development of a site like EveryBlock, you make choices that in the moment seem expedient, but actually introduce too strong a dependency between different parts of the site, making it hard to stay up when one of them goes down. Software people refer to this as tight coupling, and it’s better, in a distributed server environment like AWS as in software, to be loosely coupled. That requires us revisiting some design decisions and will take a little time to roll out."
Far from EasyBlock's experience casting doubts on the validity of Cloud Computing or Amazon's offering, the events of the weekend have not diminished the firm's faith in the Cloud: "We’d like to give an unequivocal endorsement of AWS. It is a terrific service, and we love being able to set up new servers as needed, and with great flexibility. Web site hosting is an art, and sites and hosting providers do go down from time to time. Overall, AWS has provided EveryBlock consistent, reliable service."
There will be a lot of recrimination in the days and weeks to come and the naysayers will be honing their anti-Cloud spin for some time to come. The words of firms such as EasyBlock will hopefully counter some of this, but for some good to come from this debacle it's to be hoped that organisations everywhere take a long hard look at their perceptions of the Cloud. If you think Cloud is a silver bullet, if you think Cloud means you can cut out your ICT department, if you think that Cloud is a way out of having to manage your own systems, then think again.
Three main lessons among the many to be learned:
  • You need to look long and hard at your SLAs.
  • You need to address issues such as availability and back-up and for that you're going to need skilled ICT people.
  • And if you don't follow your Cloud service provider's advice, then don't complain when you get your fingers burned.

Replies (1)

Please login or register to join the discussion.

Salil Rajhans
By salilrajhans
28th Apr 2011 13:29

I am a fairly new to the cloud and pardon me if I sound Naive here.. My impression of the cloud is a solution for the mid tier business to deploy applications on the cloud with an aim to reduce their capex and opex with this solution.. To be robust if we they have to build redundancy in the cloud will these solutions continue to be lucrative?

Thanks (0)