8 Jul 2013

Disaster at Online/Iliad DC2

Posted by iwgcr

On July, 04 the DC2 datacenter of Online, a company heled by Iliad, when down for three hours due to an electric outage.

This datacenter is composed of seven independent power channels: 2 for standard hosting (A and B), 3 for critical hosting (C, D and E) and 2 for airconditioning (F1 and F2).

Online give a full report of the incident, here some verbatim:

10:21:14 am the EDF substation “Vitry-North” is seriously impacting dysfunction (explosion of the transformer): Ivry sur Seine, Vitry sur Seine, Charenton, Maison Alfort. Our four high voltage cables supplying the data center (2-wire work, 2 spare) are simultaneously cut.

10:21:33 am The 7 electric generators failover successfully, without a break.

10:21:34 am UPS A4 and A5 of the A chain are in default, without cutting and inconsequential considering the N +2 redundancy of the power chain.

10:22:45 am A first generator dedicated to air conditioning (GE-F1) suffered a engine issue and stops default “frequency out of tolerance.” The electric chain associated automatically switches to the standby generator (GE-S) to 10:23:08.
During the switching, given two close cuts, room temperatures increased slightly 3°C without consequence.

10:26:30 am A second generator, of the chain A, stops by “electronic default”.
The electric chain associated automatically switches to the standby generator (GE-S), without consequence.

11:18:11 am The emergency generator (GE-S) stops due to a major mechanical failure.
A chain that has no or, the main arrival EDF (composed of 4 independent cables), or its generator GE-A, or standby generator GE-S, the six possible power sources are unavailable rooms are fed from the UPS batteries.

11:29:18 am Batteries of the  chain A inverters are depleted.

11:41:23 am The generator GE-A is manually restarted in override mode (defects are suppressed and ignored) to replenish the channel A. Return of energy in all rooms.

11:54:20 am Back to electricity on our four high voltage cables. Chains rebascules on EDF without service interruption, except for the chain A that stays on generator to charge the batteries and switch safely.

4:28:20 p.m. End of charge UPS batteries. The electric chain A is switched on manually EDF successfully.

From 11:45 a.m. to 7:30 p.m. The inverters A4 and A5 are repaired, generators GE-F, A-GE, GE-S are repaired. Several tests are performed to ensure proper operation of the infrastructure.

Impact of the outage :

  • 3000 servers and management console has been impacted by the power failure (less than 7% of the park).
  • At 12:20 am, 85% of the park was returned to service. 13 switches have failed have been replaced.
  • At 14:00 250 servers requiring unavailable hardware operation.
  • At 14:30, there was only one hundred unavailable servers. A bug with the failover IP and IPMI is corrected.
  • At 11:00 p.m., there is only ten servers requiring further action, particularly at the raid cards.