Jun 07, 2005 21:56
Yesterday, our domain controller (which is also our main file server) took a dump. I went into the server room to see a big ugly RAID BIOS warning message, indicating one of the drives in the array could no longer be found. I selected the Continue Booting option, and the OS booted up just fine. Everything seemed just fine, so we ordered new drives next day air, and planned on some downtime to put them in. To facilitate this, I was going to come in at 11am and work until 7:30.
At 9am, I got a call from my fellow IS worker. The remaining drive had started spitting corruption errors, and some files on the shares were unavailable. The worst had happened -- the second drive was failing. Our plan for scheduled downtime after work hours was out the window. We had to get the new drives in, NOW. I got ready very quickly, jumped in my car, and made it to work relatively soon thereafter (light traffic, huzzah).
We decided to try and put one of the new drives in, add it to the RAID array, and let the controller mirror the data to it. Then we could take out the old drive, put in the 2nd new one, and mirror from new1 to new2. It seemed like a sane enough plan. The RAID controller could grab as much data as the failing disk would give up, then mirror it to the two new disks. Only problem was, the controller failed to recognize the first new disk. Furthermore, the OS decided it no longer recognized the controller at all, and wanted a new driver for it. Around this time, I decided we might have a controller failure on our hands.
To test the theory, we put the old, "failing" drive into my desktop computer. The filesystems showed up fine. The data showed up fine. The data copied fine. THE DISK WAS PERFECTLY FINE. It was only a controller failure! The only problem remaining was to determine whether the day of running on a dying controller had caused any file corruption. We were almost relieved that the drive had completely failed the previous day -- that meant we had a pristine snapshot of the filesystem, just as it was the day before yesterday.
After buying a new RAID controller, we created a new mirrored array with the new disks. Then we plugged one of the old disks into the machine's onboard controller, and began to copy the files off it. It seemed to go well, but took long enough that we had to leave it running when we left. Hopefully it will mostly finish, but there will undoubtedly be some fallout to deal with tomorrow.
All in all, I think we dodged a gigantic fucking bullet!
Update: failed to mention -- the old drives were 120g, the new ones are 400g. Unfortunately, the new controller only sees them as 128g. ARGH! We couldn't spend time figuring it out, though, we just had to get the box back up.
Update, next day: so, the Users directory didn't get copied all the way. It died in the middle of the night. I got most of it copied by hand today, will finish it up tomorrow. Everything seems to be working. I am still waiting for the other shoe to drop.