From the desk of... DEE DEE DEE!: guinnessduck

guinnessduck

From the desk of... DEE DEE DEE!

Sep 28, 2006 19:00

WARNING: Rant and stupidity ahead.

I was gonna post about this yesterday - in the middle of the shit storm - but now that it's through, I can tell the whole story.

Monday we got an email saying that an entry in MDB (our database of machines - servers and workstations - all unix/linux, plus data-center windows hosts) was corrupted. We ignored it till Tuesday (the email came in close to end-of-day), then we found quite a few in disarray - 20% of them. Due to the nasty looking output, it looked like the DB was eating things, randomly.

No one really owns this service, but because my side of things (Unix/Linux) relies on it so much (to build login access, elevated privs access, etc), most groups just assume it's something I handle. The boss of the only guy who actually does know more than I do about MDB came in and tried to just hand if off to me. Since I was more concerned about getting things squared away, I just sort of waved my hand at him and said I was working on it. Then it occurred to me I'd been dumped a broken pile of shit. GAH. Quick email to my boss fixed that, "I ain't takin' the bullet for this!"

I emergency restored a backup from Friday, pulled records for the now corrupted entries and perl-fu'd a cleaner to overwrite only those against the live DB. Left stating that the nightly update should bring those older records current.

Today I strolled in and... back to all dorked. Everyone's flipping out. I'm point on this damn thing and I have a half dozen people in my office drawing flowcharts and shit. Someone remembers that some service just came online a week ago and it directly talked to the DB, "Oh, but that can't be it - it only updates one field!" "Yes, well, everything else has been ruled out, and the database has been wiped then rebuilt from flat files. It's the only thing that matches the 'about 200 entries last week to the number of entries today' pattern. Test it."

Lo and behold, we verified our culprit. Their method of doing that *one* field update is to read the whole record in, change the field and write the whole record back. In the process they're stripping new-lines and neglecting to purge the data before reading the next record. So fields are bleeding into the next record if that next record's field was blank and concatenating multiple-lines together. Awesome sauce. That's two text-book coding mistakes in one!

At least they caved when I told them to test it. Oh, and the reason why this didn't get any testing before going live? They didn't think there was a development instance of the service. Well, there wasn't per se, but we weren't even asked. If we had been we'd have set them up with their own person dev instance since it's so easy to do.

Oh yeah. It's a Blarney Stone night, alright. Time to get my Guinness on.

stupidity, bad code, work