Colo facilities: don’t trust those generators, and backup your backup plans

OK, I should now be in bed. But as I was browsing my blogroll, I noticed Dave Sifry’s post relating his not-so-cool week-end spent fixing Technorati’s infrastructure due to a fire at his colo. Unusual ? Unique ? Hardly so…

This must be the tenth story I hear about a colo facility that had (supposedly) all the required redundancy to "insure" reliability, including the (infamous) Diesel generators that kick in to take over short term UPSs. The issue is that those generators never seem to kick in (I hope that hospitals use a different brand than buildings and datacenters).

One of my former portfolio companies, an ASP, that did not go public on the issue but wrote to its rather unhappy clients, faced exactly the same issue, and the note from the CEO contained very similar statements to David’s. Here is a brief excerpt, in which I only removed named references:

On Monday morning at 9.45am there was a complete power outage at …, our datacenter provider. Although our Uninterruptible Power Supplies (UPSs) were triggered and ran, …’s diesel generators failed to start before the batteries in the UPSs ran out. As a result, all our servers abruptly lost power. The power was restored by 10.45am, but some severe damage had been done to our infrastructure by the abrupt power failure, and the surge which took place when power was restored.
[Pages of detailed explanations deleted]
On behalf of the whole … team, I would like to apologize most sincerely to our clients for this severe lapse in our service. Rest assured that we will be working extremely hard in the coming days and weeks to review every aspect of the resilience of our service, and to ensure that an incident like this cannot happen again.

Interestingly similar, ain’t it ?

As to the underlying issue, the disaster recovery plan, it seems to be a common mistake to believe that having some level of redundancy leads to reliability… which means that a lot of time is generally spent on designing a technical infrastructure (comms, power, servers, disks, backups,…) that will "always" work, as opposed to defining the processes and procedures that will be applied if the s..t hits the fan, the database is completely corrupted, and servers aren’t able to restart. Ie when Murphy’s law kicks in (aka the "buttered slice of bread theorem" – or "Theoreme de la Tartine Beurree", which states that if you put some spread on a slice of bread, and the slice falls, it has a %$@^%*$ tendency to repeatedly fall on the wrong side).

I am no pro in datacenter operations, but here are a few simple rules that seem useful to apply:
- Assume the worst: don’t count on those diesel generators to kick in. Expect that a surge will fry a portion of your hardware when power comes back. Expect that your backups will be damaged or unusable. Why ? Because this is what happens all the time.
- Establish a complete checklist, and procure diagnostic tools: if after restarting the system sort of works, but not fully, do you know where to start your investigation ? Do you know which tools will help validating each layer of your infrastructure ? Like if it is one of the routers that has lost its configuration and sends bogus routing information ?
- Have spare parts, or spare equipment, available: disk, power supply, network card,… don’t cost much but in some cases they might put you out of service for hours or days if you don’t have the required model handy to do a swap.
- Know how to deal with these backups: how many times have you been told that for "unknown/mysterious/fortuitous" reasons, the backup that contained your precious file was corrupted and the only version that could be recovered was… 3 weeks old ? Daahhh ? Unfortunately that happens all the time. So think through what it takes to get those backups to work, and implement those procedures.
- Two of everything is better: two different sites, hooked to different power distribution centers and comms providers, running a mirror infrastructure in load balancing, master/slave, manual switch,… is better than just one site. It costs more, it is a pain to manage, but that’s what clients might expect from you in terms of service – at the end of the day.

And oh, one last thing: it is not because it has happened to you once that it won’t happen again. This co I mentioned… had exactly the same problem a few months later, as they were installing their new "redundant infrastructure". So when you have identified a flaw in your disaster recovery plan, patch it immediately and plan for a suitable solution in the short term.