5 Mar 2013

Juniper Routers go down CloudFlare

Posted by iwgcr

On Sunday March 03, at 09:47 UTC, CloudFlare dropped off the Internet. The outage affected all of CloudFlare’s services including DNS and any services that rely on their web proxy. During the outage, anyone accessing CloudFlare.com or any site on CloudFlare’s network would have received a DNS error. Pings and Traceroutes to CloudFlare’s network resulted in a “No Route to Host” error. CloudFlare currently runs 23 data centers worldwide, connected to the rest of the Internet using routers.

cloudflare_outage

Matthew Prince, Cofounder and CEO of CloudFlare said:

The cause of the outage was a system-wide failure of our edge routers. CloudFlare currently runs 23 data centers worldwide. These data centers are connected to the rest of the Internet using routers. These routers announce the path that, from any point on the Internet, packets should use to reach our network. When a router goes down, the routes to the network that sits behind the router are withdrawn from the rest of the Internet.

CloudFlare use Juniper routers and propagate their router rules by using a protocol called Flowspec. in response to an attack on one of the CloudFlare customer DNS, someone on the CloudFlare operations team spread a discard rule corresponding exactly to the attack profile: a packet-length between 99,971 and 99,985 bytes long.

Matthew Prince precise:

Flowspec accepted the rule and relayed it to our edge network. What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed.

[…]

We have already reached out to Juniper to see if this is a known bug or something unique to our setup and the kind of traffic we were seeing at the time.

Source: CloudFlare post-mortem explanation

Date

Service

Duration

Critical Data Lost

2013-03-03 Cloudflare 1 hour no