3 Jan 2013

Amazon apologises and explanation for his Christmas eve outage

Posted by iwgcr

amazonLast week, have reported an outage of Netflix at Chrismas eve who has blamed Amazon Web Service. To reassure customers, Amazon has published a postmortem message detailing the December 24 failure, following a human error that made accidental deletion.

The service disruption began at 12:24 PM PST on December 24th when a portion of the ELB state data was logically deleted. […]  The data was deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment.

[…]

The team attempted to restore the ELB state data to a point-in-time just before 12:24 PM PST on December 24th (just before the event began). By restoring the data to this time, we would be able to merge in events that happened after this point to create an accurate state for each ELB load balancer. Unfortunately, the initial method used by the team to restore the ELB state data consumed several hours and failed to provide a usable snapshot of the data.

[…]

We have made a number of changes to protect the ELB service from this sort of disruption in the future. First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data. The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval. We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval. This would have prevented the ELB state data from being deleted in this event. This is a protection that we use across all of our services that has prevented this sort of problem in the past, but was not appropriately enabled for this ELB state data.

 

Date

Service

Duration

Critical Data Lost

2012-12-11 Amazon 24 hours no