I thought people might be interested in what I'd been up to lately.
Plus I need to rant about it a little. If you are upset by ranting or techspeek, now would be the time to turn away. *grin*
So. Last week, about Tuesday or so, whilst attempting to resolve an issue with the log management[0] I noted that one of the two home-drive servers had a failing disk.
Since these are redundant only in that there are two servers, and as a result of the growing logfile size, they're running out of disk space, and there's a big fat bug in the drbd code on the version we were running, I got a couple of extra disks and started building a pair of replacements. Note, started. That means I spent until ten pm or so on each of Wednesday, Thursday, and Friday, trying to get these into a state where I could put them in ASAP.
On Saturday, it being the only fine day we had this weekend, at least according to the weather forecasters[1], we trundled off as a family to the fair, and spent the day wandering around the fair and having fun. If I remember correctly, we went in, watched Tornado have a go on a real steam train (albeit only about 3ft high, including the rails it was sitting on) and then asked the nice gentleman driving it what all the levers and knobs did. Both Tornado and I were, candidly, fascinated. I should note here that prior to him going on the train, the guys running it were adjusting the rails to get the train to run flat. When on the train, Tornado leaned over the side to look at the rails. At the right spot.
I was much amused. His mother, on the other hand, seemed to think it was unsurprising; "He's one of you lot - I thought it was obvious by now" was the gist of her slightly paraphrased comment on the matter. *grin*
After that, we wandered around, past the archery stall (Tornado had a go, and missed with every shot; he did get the idea of how to fire it, though, with almost no help, which was good.) and a bunch of other stalls; $wife sent me and himself up the tower to come down the slide on doormats; Tornado got left behind at the first corner, where he got stuck, but he enjoyed it nonetheless. After that, we sat down and watched a clown and acrobat act in a smallish big top[2], which was entertaining; got himself on the suspended trampoline thing (they put you in a harness and winch you up so you're light, and then you get to bounce... lots of fun, except he's so light he wasn't reaching the trampoline on the downstroke, as it were, without assistance...) and got a few photos taken of him by the local newspaper guy, who happened to be going past[3] and let me wander off to get a hat[4] ... And we went on the ferris wheel and spent some time looking at how the wheel itself was put together, and finished the day out off with smoothies.
All in all, a great day out.
So much for the fun.
After I got home, I had a look later that evening at the servers. The server with the failing disk.. had failed totally. Yay. Remote reboot didn't bring it back, so we're left hoping like hell it can survive until I can get in the next day. By this time I've spent two and a half hours trying to recover things, without success. Bedtime, and get up on Sunday and head into the colo to fix issues.
Get into the colo, get the system working on the redundant pair, so we can then remotely log in and sort issues, and head on back to the office to finish the server build. Get there around 12:15 or so. Log into the servers from there, fix the current fault, and start the build. About 9pm, finish the build (heartbeat, drbd, NIS, NFS, all with redundant failover. Yay for complex systems) and take the disks down to the colo. Get things set up, check people can log in, and head home. Get home, and discover the check didn't work; I can log into the blades, but nothing else. The remote serial console for the Sun boxes is down; my cow-orker is supposed to have fixed that, but either he failed to do so, or it broke again. My money is on the former, but he's on holiday this week, so I can't bollock him in person.
Fix what I can, and head on back to bed; again, after 1am. By this time, $wife is not happy with the time she's been stuck with Tornado[5] so I offer to take him with me the next day, since it's a trivial fix; it just requires physical presence, basically.
Next day, check permission for Tornado in the colo[6] and head on over there. Sort out the issues, bounce one of the objectionable servers, and call up my other cow-orker[7] to confirm things are fixed, then head on home.
Get there to discover things aren't fixed. After some poking, discover yet another problem on the server I restarted, fix it, and all the problems go away.
So... the rest of my week has been either cleaning up problems that are mostly resolved, or sitting there whilst people go over and over the same details trying to make it look as good as possible for the clients - and then go over them again when, due to removing half the details, the client gets the wrong end of the stick and promptly wants to know why one singular disk failure can take down the entire system for the weekend...
Arguably, they have a point. On the other hand, as noted above, it wasn't one failure, but six - the disk, the failover to the other server, the ongoing failure of one of the main server, the serial console device not being fixed, the other main server also failing in the same way, and, finally, the startup scripts not being set up on one of the new sharefeed database servers.
After all that, I need a holiday.
[0] The logfiles had reached 3Gb/day[8] , which is large enough that we need ~11Gb free space to merge them, and they were taking ~10 hours to resolve the IP addresses, and then another 9 hours to process into reports...
[1] For reasons that will become obvious, I had little chance to compare the forecast with reality...
[2] Yes, I realise that sounds contradictory.
[3] Copies of these may be forthcoming, depending on how much we like you. ;-]
[4] To go with my sunglasses, of course.
[5] I can't blame her, and I wasn't too happy with it myself, but I digress...
[6] Technically, they "don't have insurance for children, but as long as he's in and out, it's ok. Just... don't say anything, k?" more or less. IOW, don't ask, don't tell...
[7] The windows one. At least he's around...
[8] I resolved another issue on Tuesday, where someone put in some broken js code on a homepage, and caused a 500k hits/month site to request 6.5m/day for the homepage. Logfiles are now back down to 1.5Gb or so...