?

Log in

No account? Create an account

Josh-D. S. Davis

Xaminmo / Omnimax / Max Omni / Mad Scientist / Midnight Shadow / Radiation Master

Previous Entry Share Next Entry
Grumble about LJ powerloss again
Josh 201604 KWP
joshdavis
updated Monday, 2005-01-17
OK, so LJ was down because of powerfail. Parts were down from Friday until Monday.
The current powerloss doc says, "We're going to be buying a bunch of rack-mount UPS units on Monday so this doesn't happen again."

The same thing was said in Oct, 2001 when LJ had a similar powerfail.


Here is the copy of the text from 2001
Evan Martin (evan) wrote in lj_maintenance,
@ 2001-10-30 18:21:00

As some of you may have noticed, the site was down for much of today.

What happened? A power failure, apparently.

This surprised us: Internap, our host, has redundancy everywhere: multiple network connections, power grids, backup generators... even the building has some certifiable safety against earthquakes. Our servers were connected to both power grids. We didn't anticipate a power loss. Internap is really a great company, and we've had no problems with them for the past year.

There's one weakness in Internap's system, though, and that was discovered today. In the case of a fire, they need to be able to turn off all of the power running into the building completely so there's no possibility of firefighters being electrocuted.

There's a big red button in a glass box in some room somewhere that does this. Somebody took a visitor near that box. The visitor mistook the button for the button to unlock the door release...

...pow.

(The button is now labelled.)


The site took a long to get back because bradfitz, our dedicated leader, ran a full integrity check on the database. A database is split into two separate files: the actual data, and the indexes into the data.
The indexes were corrupted, but the data is fine. Do not worry. :)
In the worst possible case, we can recover lost data from a backup, but any weirdness you may encounter is more likely related to the broken indexes.


How can we handle this in the future?
- We're buying an Uninterruptable Power Supply, so our machines can handle power failures gracefully. We thought one wouldn't be necessary, but it seems it would be best to stay on the safe side.
- (Technical digression:) A couple file systems needed fsck'ing. We may move to ext3 in the future.
- A few of the machines weren't configuring themselves properly when they booted. dormando has been fixing this.
- Once this is in place, we can test by unplugging our machines from the power.

Thanks for your patience. For once, it's not our fault! :)

Google Cache copy

Here is the copy of the text from 2005
Brad Fitzpatrick (bradfitz) wrote in lj_maintenance,
@ 2005-01-16 14:31:00

A public power loss port-mortem is postponed pending the pensive completion of my pitiful weekend.

But yeah, we'll have a lot to say tomorrow. For now, enjoy your LJ. I apologize for the downtime. Please report any problems to support, not here in comments.

I need to get outdoors.


Here is the copy of the powerloss text from 2005
Our data center (Internap, the same one we've been at for many years) lost all its power, including redundant backup power, for some unknown reason. (unknown to us, at least) We're currently dealing with verifying the correct operation of our 100+ servers. Not fun. We're not happy about this. Sorry... :-/ More details later.

Update #1, 7:35 pm PST: we have power again, and we're working to assess the state of the databases. The worst thing we could do right now is rush the site up in an unreliable state. We're checking all the hardware and data, making sure everything's consistent. Where it's not, we'll be restoring from recent backups and replaying all the changes since that time, to get to the current point in time, but in good shape. We'll be providing more technical details later, for those curious, on the power failure (when we learn more), the database details, and the recovery process. For now, please be patient. We'll be working all weekend on this if we have to.

Update #2, 10:11 pm: So far so good. Things are checking out, but we're being paranoid. A few annoying issues, but nothing that's not fixable. We're going to be buying a bunch of rack-mount UPS units on Monday so this doesn't happen again. In the past we've always trusted Internap's insanely redundant power and UPS systems, but now that this has happened to us twice, we realize the first time wasn't a total freak coincidence. C'est la vie.

Update #3: 2:42 am: We're starting to get tired, but all the hard stuff is done at least. Unfortunately a couple machines had lying hardware that didn't commit to disk when asked, so InnoDB's durability wasn't so durable (though no fault of InnoDB). We restored those machines from a recent backup and are replaying the binlogs (database changes) from the point of backup to present. That will take a couple hours to run. We'll also be replacing that hardware very shortly, or at least seeing if we can find/fix the reason it misbehaved. The four of us have been at this almost 12 hours, so we're going to take a bit of a break while the binlogs replay... Again, our apologies for the downtime. This has definitely been an experience.

Update #4: 9:12 am: We're back at it. We'll have the site up soon in some sort of crippled state while the clusters with the oldest backups continue to catch up.

Update #5: 1:58 pm: approaching 24 hours of downtime... *sigh* We're still at it. We'll be doing a full write-up when we're done, including what we'll be changing to make sure verify/restore operations don't take so long if this is ever necessary again. The good news is the databases already migrated to InnoDB did fine. The bad news (obviously) is that our verify/restore plan isn't fast enough. And also that some of our machine's storage subsystems lie. Anyway, we're still at it... it's long because we're making sure to back up even the partially out of sync databases that we're restoring, just in case we encounter any problems down the road with the restored copy, we'll be able to merge them. And unfortunately backups and networks are too slow.

Update #6: We're up again, but only partially. Some database clusters are still reconstructing/syncing. See status.livejournal.com.


And finally, a local copy of the final related status page from 2005
At 7:48 pm GMT on Sunday, January 16th, Admin matthew writes :

As you're probably aware, our colocation facility suffered a massive power outage last night. For more detailed information about the incident, please see our power loss status page.

We have all database clusters back online at this time. We are monitoring the situation to catch any issues that might crop up.

Please contact LiveJournal Support in the event there is anything missing from your journal from before the downtime. We have tools to recover missing data from backups, if necessary. Unfortunately, if you are on the Chef cluster, any updates you made between approximately 5PM PST and 8PM PST were not recorded in the database and have therefore been lost and cannot be recovered.
Tags: