Oct 20, 2007 02:24
Well, today didn't go very well. The only thing I actually had to achieve today was to install a motherboard into a new case, get an operating system and a database server on it, and have it act as a backup for the main server until Monday, when their roles would be switched.
I got the hardware part out of the way without any problems, but got a little stuck trying to install Windows 2000 since it didn't recognise our SATA hard disk. I tried another CD that Graham had put together which had SP4 built in, which managed to install, but not to actually work. It crashed every few minutes, completely randomly, and it was clear that although the installation had 'succeeded', it wasn't any good.
So, we reverted back to the original Windows 2000 disk, but tried to put the SATA drivers on a floppy disk for the installation to load. Except that I didn't have a spare floppy drive around, except one in a very old machine that I couldn't seem to remove. In the end I just lifted the entire case up beside the new one, and connected the drive in place. Went back to my own computer (the only other one with a floppy drive) to make the disk, and found my drive couldn't read or write to it.
So, gave up on that and installed XP instead. It seemed to go alright from there, until we tried to get an up-to-date copy of the database onto the newly prepared machine. My idea was to take the MySQL data directories and my.ini file from the machine it was to be replacing, switch replication back on, and let it catch up. That would've worked, except that the reason we were replacing the backup server in the first place was that it's drive controller had failed, and it had damaged the main InnoDB data file. It still did almost work, except one table that was damaged. I decided that for now (until we could bring the other server down to get a copy of it's data) we could just drop the table and recreate it's structure, so that in an emergency, at least the client software would be able to connect to it.
So, I made sure I was looking at the correct server, and dropped the table. A few seconds later, one of the girls comes from the next office and tells me every client machine just failed. I was sure I'd only dropped it on the one they weren't using... then I realised: replication. The two servers were still replicating each other, and so I'd just dropped the orders table on both.
Needless to say, the rest of the night didn't go well. I did it sometime around 8pm, and it's now almost 3am the next morning, and we're still here clearing up the mess. We restored a backup snapshot created that morning, but everything between then and when I dropped the table is gone. I did get a copy of it before we restored the backup, but it's just binary logs, and I really have no idea what to do with them - I don't have a known state to go back to and replay them from. Getting the system back online took about an hour in all, but the aftermath of the chaos the downtime caused went on for much longer. Friday night is very busy here, and all the orders were just being scribbled down on paper. Orders were lost, mistakes were made, customers got angry... because I made a stupid mistake.
Now I'm here on my own time to get everything sorted out so that normal operation can resume for tomorrow. Graham is here too though he doesn't have to be, although he is getting paid for it 'cause it wasn't his fault.
cali,
mistakes,
stupidity,
work