Amazon S3 services down today — ugh

July 20, 2008

It seems every few days or weeks we have a reminder that cloud services are experiencing growing pains. I wrote about this topic a few weeks ago, and also about an incident with Amazon. Today, Amazon S3 services are down.

In fact, a nice reminder from the folks at WordPress tells me that this impacts me directly on my blog:

Notice: Because of an issue with Amazon’s S3 service, some images maybe unavailable for viewing after upload. We are working on resolving this issue and will update the following forum thread with more information as it is available.

The AWS Service Health Dashboard shows a lot of nice information for people keeping up with the outage. I give Amazon a lot of credit for trying to be transparent as possible, but it is troublesome to see a multi-hour outage take a lot of investigation to just figure out “the why” and in the mean time users are left without a reliable backup service and must scramble to fix things locally. Perhaps there is a need for someone to provide a best practice for setting up a local set of information that can be regularly synchronized with the cloud so one can at least control their backup. Otherwise many services and websites that store content and data are simply broken or offline while waiting for S3 in this case to come back to normal.

The S3 details from the AWS Service Health Dashboard

9:05 AM PDT We are currently experiencing elevated error rates with S3. We are investigating.
9:26 AM PDT We’re investigating an issue affecting requests. We’ll continue to post updates here.
9:48 AM PDT Just wanted to provide an update that we are currently pursuing several paths of corrective action.
10:12 AM PDT We are continuing to pursue corrective action.
10:32 AM PDT A quick update that we believe this is an issue with the communication between several Amazon S3 internal components. We do not have an ETA at this time but will continue to keep you updated.
11:01 AM PDT We’re currently in the process of testing a potential solution.
11:22 AM PDT Testing is still in progress. We’re working very hard to restore service to our customers.
11:45 AM PDT We are still in the process of testing a series of configuration changes aimed at bringing the service back online.
12:05 PM PDT We have now restored communication between a small subset of hosts. We are working on restoring internal communication across the rest of the fleet. Once communication is fully restored, then we will work to restore request processing.
12:25 PM PDT We have restored communication between additional hosts and are continuing this work across the rest of the fleet. Thank you for your continued patience.
12:51 PM PDT The restored hosts are stable and we are moving forward in restoring communication between additional hosts.
1:17 PM PDT We continue to make incremental progress and communication between additional hosts has been restored. We are continuing with the plan to restore communication across Amazon S3’s large fleet of hosts.
1:38 PM PDT At this point, we are accelerating progress on restoring internal communication as all signs continue to look good.
2:03 PM PDT We have restored all internal communication between hosts in the EU and we are continuing to make progress in the US. Once all internal communication has been restored, we will start a multi-step process to begin accepting requests across Amazon S3 locations.
2:19 PM PDT A quick update to let you know that we have now also restored all internal communication between hosts in our West Coast facilities in the US.
2:36 PM PDT We have restored all internal communication across Amazon S3 hosts. We have started the multi-step process to begin accepting requests across Amazon S3 locations.