I hate technology.

Oct 19, 2007 23:14



My day has not gone well.



So, the prelude to a bad day started with this innocent message from
velius, sometime early this morning:

Oct 19 01:12:10 velius kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 19 01:12:10 velius kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=194142885, high=11, low=9593509, sector=194142885
Oct 19 01:12:10 velius kernel: ide: failed opcode was: unknown
Oct 19 01:12:10 velius kernel: end_request: I/O error, dev hde, sector 194142885
Oct 19 01:12:19 velius kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct 19 01:12:19 velius kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=194142885, high=11, low=9593509, sector=194142885
Oct 19 01:12:19 velius kernel: ide: failed opcode was: unknown
Oct 19 01:12:19 velius kernel: end_request: I/O error, dev hde, sector 194142885

That's all that velius bothered to say; it sent mail reporting the
problem, then went back to sleep for a short while. But at 2:57PDT, all hell
broke loose:

Oct 19 02:57:40 velius kernel: hde: dma_timer_expiry: dma status == 0x21
Oct 19 02:57:50 velius kernel: hde: DMA timeout error
Oct 19 02:57:50 velius kernel: hde: dma timeout error: status=0xd0 { Busy }
Oct 19 02:57:50 velius kernel: ide: failed opcode was: unknown
Oct 19 02:57:50 velius kernel: hde: DMA disabled
Oct 19 02:58:25 velius kernel: ide2: reset timed-out, status=0xd0
Oct 19 02:58:25 velius kernel: hde: status timeout: status=0xd0 { Busy }
Oct 19 02:58:25 velius kernel: ide: failed opcode was: unknown
Oct 19 02:58:25 velius kernel: hde: drive not ready for command
[...]
Oct 19 02:58:55 velius kernel: end_request: I/O error, dev hde, sector 121667077
Oct 19 02:58:55 velius kernel: Buffer I/O error on device hde7, logical block 30968
Oct 19 02:58:55 velius kernel: lost page write due to I/O error on hde7
Oct 19 02:58:55 velius kernel: end_request: I/O error, dev hde, sector 212383301
Oct 19 02:58:55 velius kernel: Buffer I/O error on device hde7, logical block 11370496
Oct 19 02:58:55 velius kernel: lost page write due to I/O error on hde7
[...]

That's my 500GB drive, good old hde, annihilating
itself. And those messages are from the local backup archive being
destroyed beyong recognition. I'm crossing my fingers that the offsite
backups are okay. By the time I woke up the drive was completely
unreadable. Backups, gone. Video archive, gone. Audio archive, gone.
Fortunately this wasn't the boot drive, and the archival data on this drive
exists in either DVD or CD form. I was relatively unconcerned as I walked
out the door this morning to begin my day of breaking hardware designs,
somewhat bitter that my hardware had gotten a head start on me.

One ReadyNAS NV+ in a 2x500GB Seagate configuration was shipped FedEx
next-day from Florida, approximately four hours after I first discovered the
problem. So much for keeping this month's credit card bill low.

I came home this evening to begin data recovery efforts, with the aid of
an old add-in SATA card in quistis. The drive was spinning up just
fine, no abnormal sounds, and I was hopeful that the failure was actually
the onboard SATA controller in velius. Alas, quistis
quickly reproduced the access problem. I was able to mount partitions 1, 2
and 5 (all of which are empty), but attempts to mount any higher partition
consistently failed. Not so great news. The first error was always a
device fault, only recoverable via a hard system reset.

While the mount attempts failed miserably, I did find that I was able to
dd data directly off of the partitions. Not sure if it's the
linear seek or the complete lack of write activity, but no errors have
surfaced in any of my dd attempts. This may be good news for data
recovery, because as soon as I have someplace to put a 250 GB ext3
filesystem image, I should be able to mount it with a loopback and start
restoring data. Hopefully when the NAS arrives on Monday I'll be able to
dd filesystem images to a RAID array and pull data off from there.
Great news!

My upbeat mood lasted right up until I tried to power velius up
again, sans the now-removed hde. Go ahead, guess what happened.

AddrMarkNotFound errors on the boot drive, hda. Kernel panic
when it utterly failed to mount the root filesystem. Reproducible error.
Guaranteed data loss if this drive fails. You see, this drive has
all of the data that's been waiting to go out to DVD. Stuff that would suck
to lose. It also has all of the data that is most likely to have changed
since the last successful nightly backup. Pacolyn, the fridge, webserver
files, photos... and of course, the root filesystem.

Unlike hde, this drive is clicking noisily now. Great.

Fortunately, after five tries I managed to get velius to
reliably read from hda again, but the drive is most definitely compromised
now. I now have to find a home for 80GB of data, much of which was waiting
to be burned to DVD (and is thus not backed up). I don't have that much
space left locally. Great, just... great.

The frustrating thing?

I was going to do the DVD burns this weekend.

*sigh*

Previously.

technology, bugs

Previous post Next post
Up