Power-loss post-mortem: lj

bradfitz in lj_dev

Power-loss post-mortem

Jan 20, 2005 02:48

The post you've all been waiting for! Why we lost all our power, and why it took us so long to come back up afterwards....

(Warning: it's late and I'm tired/rambly, so this post might be incoherent... go ahead and ask questions...)

Why we lost all our power...
Another customer in the facility accidentally pressed the EPO button, then depressed it, replaced the protective case, and left the building. Intenap all thought it was their UPS systems failing, but then logged into them, saw EPO shutdown notifications, and couldn't find any EPO cases open or pressed, so probably freaked out for a bit thinking there was a short in the walls that triggered the EPO, only to get a confession a day or so later.

EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center. This is the second time this has happened to us in the years I've been there. The first time the button was unlabeled and unprotected and some dude thought it opened the door. This time we have no clue why it was pressed... maybe that dude tripped and fell onto it... mystery.

Internap will be putting alarms and tamper-proof indicators on the plastic cages surrounding the EPO buttons now, though, so at least if this happens again in the future they'll know why.

Anyway, moving on....

Why it took us so long to come back up...
Ton of reasons:

Faulty mobos/NICs: We have 9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly. They only work with certain switches, so they reboot fine, but then their gigabit network comes up at 100 half duplex or something that doesn't work. To get them back up they need somebody at the NOC to plug them into a compatible switch, let them autonego, then switch them to their real switch. Setting the speed/duplex settings on both the host and/or switch themselves doesn't work.... most annoying. We're getting working dual-port gigabit PCI NICs for those machines, rather than replacing their motherboards.

Database start-up: All but a couple of our machines came back up when the power was restored, less than an hour later, but on our databases we intentionally don't have the database start back up on boot. In normal circumstances, if a single machine dies, it died for a reason and we want to investigate it. Normally that doesn't matter either, because we have 2+ of everything. But when every single database restarts, that leaves us with no alive databases, and we have to manually start them all.

Data validation: We could've just blindly started all the databases and trusted they worked, but we didn't trust them. We ran innodb tablespace checksum checks on everything, and also did backups of a lot of the databases before we even tried to bring them back up. (the act of bringing them back up modifies the tablespace, and we didn't want them messing themselves up worse, so a pre-backup was for paranoia...)

MyISAM vs InnoDB: When you lose power to a MySQL db w/ MyISAM tables, the indexes are generally messed and you need to rebuild. Fortunately almost all our databases are purely InnoDB nowadays, so that wasn't a huge problem. Unfortunately, though, the global DB (which is required even to get the site up in partial mode where some users are available and others aren't) is still like 5% myisam and we just hadn't got around to converting those few remaining tables to innodb yet. So every machine in the global cluster required index rebuilds and data checks. That was annoying. The Chef user cluster was also MyISAM (our last MyISAM user cluster), so rather than trust Chef, we restored from an old Chef backup and replayed binlogs to catch it up. That took some time.

Disk cache issues: We have battery-backed RAID cards with write-back caches. That means the RAID card immediately acknowledges writes and tells the OS (and thus DB) that they're done immediately, before they're on disk. This speeds up the DB. But if you lose power, those writes would normally be lost, which is why we have battery-backups on all the cards, and we even monitor the battery health w/ our automated checks. But unknown to us, the raid cards didn't disable the write caching on the drives themselves.... which is frickin' useless! If the controller is already lying to the OS (but doing it safely!) why should the disks behind the controller also lie, but unsafely, for minimal benefit? Our bad there. We should've had that right. So a couple machines were just gibberish afterwards and had to be restored from backup and had their binlogs replayed to catch the backups up to present.

Binlog syncing: We weren't using the option to sync binlogs to disk, so we lost a small number of transactions right before the power loss in the case of clusters that we had to restore from backup. Regrettably, we won't be able to get those posts or comments back.

Slaves tuned for speed: A lot of slave servers (mostly in the global cluster, since the user clusters are almost all master-master now) were tuned for speed, with unsafe filesystem/sync options that favored speed over reliability. Which is normally okay, since you'd lose one machine, not all of them. But we lost all of them, so restoring them all from good slaves was time-consuming.

Things we're doing to avoid this crap in the future...
We're:

-- getting working NICs in those 9 machines

-- all our DBs have redundant power supplies. we'll be plugging one side into Internap's, and the other side into our own UPS, which itself is plugged into Internap's other power grid. that way if EPO is pressed, we'll have 1-4 minutes to do a clean shutdown. (but if we do the rest of the stuff right, this step isn't even required, including having UPSes... in theory... but the UPSes would be comforting)

-- disable disk caching behind all our RAIDs. (bleh... wanna kick ourselves for this, but also the raid vendors for even defaulting and/or allowing it in a BBU write-back setup) but also testing all existing and new hardware to make sure data makes it to disk and pulling power in the middle of write-heavy operations, then verifying the resulting disk image later with the expected result.

-- finish our MyISAM to InnoDB migration, so we don't have to deal with MyISAM index rebuilds

-- enabling binlog sync options

-- stop tuning slaves for speed. this used to matter, but we don't really do the slave thing as much as we used to, so the gain isn't worth it.

-- user level backup harness. we already have a tool to backup a single user's to a GDBM file, incrementally. (so if we run it a day later on the same file, it only updates the changes) so we plan to wrap a major harness around that tool and be backup up all users, all the time. this means that in the event of a future major corruption/outage, we'll be able to restore user-at-a-time, and even to a different database cluster than the one they came from. this also means we can prioritize recovery based on account status, popularity, attempted activity, etc. (and yes, we'll continue doing system-level backups as well, but it's good to have data in different formats just to be paranoid...)

-- also, we already bought a bunch more disk space that we installed today, so we have more room to do backups for bizarre non-typical reasons, and don't need to compress/shuffle so much stuff to make room sometimes.