Monday, July 21, 2008

S3 Outage Highlights Fragility of Web Services


Amazon’s S3 cloud storage service went offline this morning for an extended period of time — the second big outage at the service this year. In February, Amazon suffered a major outage that knocked many of its customers offline.

It was no different this time around. I first learned about today’s outage when avatars and photos (stored on S3) used by Twinkle, a Twitter-client for iPhone, vanished.

My big hope was that it would come back soon, but popular S3 clients such as SmugMug were offline for more than eight hours — an awfully long time for Amazon’s Web Services division to bring back the service. As our sister blog, WebWorkerDaily, points out:

With two relatively serious outages in the space of 6 months, some will be asking the question of why depend on S3? The answer is simple: the rates are hard to beat, especially for service that doesn't require any sysadmin budget.

That said, the outage shows that cloud computing still has a long road ahead when it comes to reliability. NASDAQ, Activision, Business Objects and Hasbro are some of the large companies using Amazon’s S3 Web Services. But even as cloud computing starts to gain traction with companies like these and most of our business and communication activities are shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous web.

Update: Antonio Rodrigez, founder of Tabblo, now part of HP, on his blog asks the $64,000 pertinent question:

…if AWS is using’s excess capacity, why has S3 been down for most of the day, rendering most of the profile images and other assets of Web 2.0 tapestry completely inaccessible while at the same time I can’t manage to find even a single 404 on Wouldn’t they be using the same infrastructure for their store that they sell to the rest of us?

Update #2: Building an offline redundancy for Amazon S3 could be big opportunity, Dave Winer says.

Update #3: A reader sent me an email and asked these two questions

  • Is the system designed to be fault tolerant? If yes, then how did it go down? After all they must have massive arrays and mirrors of their storage infrastructure.
  • Is this a hardware failure or a software/design problem?

Random Thought: The S3 outage points to a bigger (and a larger) issue: the cloud has many points of failure - routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code.