In the spirit of the year-end listing ritual, here are Top 5 web services outages in order of magnitude.

 

  1. GoDaddy
  2. Amazon Web Services (December 24)
  3. Amazon Web Services (October 22)
  4. Amazon Web Services (June 29)
  5. Google App Engine (October 26)

 

Yes, the list seems to pose the question of what’s going on with the AWS infrastructure.

 

All three of the outages on the list involved what Amazon calls its “US-East-1” region of 10 datacenters. Implicitly, given the duration of the outages, AWS as it is currently deployed in this region does not allow for agile site switchovers. I would guess that it’s not so much the server roles that cannot be quickly switched-over between datacenters but rather the datasets—which probably are often some hours behind in being replicated.

 

The main point seems to be that the infrastructure for AWS is so large that any outage casts a large shadow.  As Amazon states in their incident report: “Though the resources in this datacenter, including Elastic Compute Cloud (EC2) instances, Elastic Block Store (EBS) storage volumes, Relational Database Service (RDS) instances, and Elastic Load Balancer (ELB) instances, represent a single-digit percentage of the total resources in the US East-1 Region, there was significant impact to many customers.”

 

To give an idea of the overall scale involved, the Elastic Compute Cloud (EC2) at the heart of AWS uses at least half a million servers for its business of handling the web traffic and transactions of its clients.

 

Sharpening Your Emergency Responses

The quicker you know about trouble in your datacenters the better your chances of minimizing associated downtime. A good user simulation agent will tell you when your web services are degenerating; and system, storage, and network montioring should help you quickly isolate the source.