Just a typical day for a Dagard

Aug 11, 2011 20:21

Customer decides "Hey, while we're upgrading our back-end switch, why not take the time to make our cables all pretty?"

Ok, moron, sure, whatever. Why are you talking to me again?

"So, we shut down all 24 nodes"

That's .... an interesting choice. Go on.....

"And then we spent like 2 hours getting all the power, IB and Ethernet cables pretty. PRETTY"

Ok.....

"Now, when we boot up, things don't work!"

Oh, well, yes, I can see how you might have a concern. How many have you brought up?

"Three, only one has come up right."

Gimme a second here..... ok, so you're running really, I must admit, old hardware. Ok. Repeat after me: 'The worst thing I can do to a computer is turn it on'

"Why?"

Just is. Okay. First off. No panicking. We're going to iterate through the nodes, you're going to plug Mr Happy Serial cable into each of them, and let me watch each of them boot.

"That'll take toooooo loooooong!!"

Well, you probably should have told us you were doing this, first, so we could disable the 200ish alerts you autogenerated, advise you on the best course of action, etc etc etc. But, whatever, we're here now. So go, monkey, do it. DO IT.

After, oh, 2 hours and change, 19 of the 24 are back up (our boot loader is SLOOOOOOOOOOOW). Kick ass. We have quorum, we're not completely out of the woods yet, but at least we're in the young growth trees.

--- later ----

After poking, inside-googling, outside-googling, it's the damnedest fucking thing that gets it past where the 5 boxen were hanging.

JUST KEEP REBOOTING (in verbose mode)

It's, near as I can figure, some initialization race condition between ACPI and multiple CPUs. Oh, and their CMOS batteries are shot, so, everybody thought it was like, 1384 or some shit, but that's secondary. But, no hardware died, no data integrity issues (yay), customer eventually happy, everybody wins. Even if I'm now having to work to get back into SLA on my actual issues, since I fucked away a good hunk of the day doing that.

Oh, and writing a perl script to do some inode inventorying. And mentoring CK (new hire). And dealing with a certain company who fails at.... a LOT of things. And glaring hard at ESX 4.1, trying to figure out why it hates our customers. And discussing work things with Helix. And trying to learn what proper SMBv2 data flow looks like, since we've now really got nobody who's an expert on it.

At some point I had macaroni, too.

So hi, how was your day?
Previous post Next post
Up