23 May 2013

3 difficult days for Rackspace Cloud Load Balancers

Posted by iwgcr

logo-rackspaceThe Cloud Load Balancers of Rackspace has suffered many issues early this week.

On May 19, an incident “Cloud Load Balancers is currently experiencing a degradation of service.” is reported at 04:32 PM EDT and closed 10 minutes later. The next day, at 09:47 AM EDT same report but not closed before 12:34 PM EDT.

Rackspace Cloud Load Balancer engineers has given more information about this issues:

Load Balancer Nodes ztn-n09 and ztn-n10 in our ORD1 data center […] experienced a rare capacity issue from a combination of active load balancers, new provisioning requests, and overall traffic. This caused both of the affected nodes (ztn-n09 and ztn-n10) to attempt to shift their traffic to the failover node (ztn-n12) simultaneously, which in turn affected network connectivity for the instances supported by ztn-n09 and ztn-n10. After several attempts to restart services for ztn-n09 and ztn-n10, engineers determined that a reboot of all four nodes in the cluster was required to restore services.

After rebooting one of the nodes, engineers discovered that the previous node failures had corrupted the global configuration files. This contributed to the inability to add the ztn-n09 and ztn-n10 back into the cluster. The corruption was corrected and the original two problem nodes were restarted and began to take traffic from ztn-n12.

On May 21, two more incident concerning Cloud Load Balancers was reported. The fist at 07:29 AM EDT on the node ztn-n05 for 30 minutes and the second at 02:51 PM EDT on global Cloud Load balancers nodes for 15 minutes.

Date

Service

Duration

Critical Data Lost

2013-05-21 Rackspace 15 minutes no
2013-05-21 Rackspace 0,5 hours no
2013-05-20 Rackspace 3 hours no
2013-05-19 Rackspace 10 minutes no