A Summary of the Amazon Web Services June 29 Outage
Summary of the AWS Service Event in the US East Region
...from Amazon <https://aws.amazon.com/message/67457/>
The event was triggered during a large scale electrical storm which swept through the Northern Virginia area
Though the resources in this datacenter, including Elastic Compute Cloud (EC2) instances, Elastic Block Store (EBS) storage volumes, Relational Database Service (RDS) instances, and Elastic Load Balancer (ELB) instances, represent a single-digit percentage of the total resources in the US East-1 Region, there was significant impact to many customers. The impact manifested in two forms. The first was the unavailability of instances and volumes running in the affected datacenter. This kind of impact was limited to the affected Availability Zone. Other Availability Zones in the US East-1 Region continued functioning normally. The second form of impact was degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region. While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another.
Systems Affected
- Elastic Compute Cloud (EC2)
- Elastic Block Store (EBS)
- Relational Database Service (RDS)
- Elastic Load Balancer (ELB)
- Elastic Cache
- Elastic MapReduce
- Elastic Beanstalk
Timeline - June 29-30, 2012
Time (PDT) |
System |
Event |
8:04 pm |
all |
Servers began losing power |
8:21 PM |
all |
Amazon status update: We are investigating connectivity issues for a number of instances in the US-EAST-1 Region |
9:10pm |
Control plane |
control plane functionality was restored for the Region |
10pm |
RDS |
a large number of the affected Single-AZ RDS instances had been brought online |
11pm |
RDS |
The remaining Multi-AZ instances were processed when EBS volume recovery completed for their storage volumes. |
between 11:15pm PDT and just after midnight |
EC2 |
Instances came back online |
2:45am |
EBS |
90% of outstanding volumes had been turned over to customers |
Note, Amazon seems strangely ambiguous on timing around the ELB outage
Summary of Amazon control plane issues
Why this is important?
- Application deployed to AWS are expected to design for failure if they plan to be resilient in the face of an outage. Such fault tolerant designs rely on several capabilities enabled by the control plane. Hence when the control plane fails, even plans to mitigate failure will fail.
Details:
- degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region
- customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region
- The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane.
- ELB service’s inability to quickly process new requests delayed recovery for many customers who were replacing lost EC2 capacity by launching new instances in other Availability Zones
- From GigOm: the AWS outage resulted in a control plane backlog that prohibited customers from failing over into Availability Zones not affected by the generator failure
Some AWS Hosted Companies Affected
- Netflix
- Heroku
More on Netflix
Why did Netflix go out?
- Amazon control plane issues (see above)
- Problems with [Netflix's] load-balancing architecture that ended up compounding the problem by “essentially caus[ing] gridlock inside most of our services as they tried to traverse our middle-tier.”
- Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose. This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.
What went right for Netflix?
- Regional isolation contained the problem to users being served out of the US-EAST region. Our European members were unaffected.
- Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.
Sources used
- https://aws.amazon.com/message/67457/
- https://gigaom.com/cloud/some-of-amazon-web-services-are-down-again
Amazon Services Dashboard