I think I have fallen into one of the great traps of release…: midiamin

midiamin

(no subject)

Aug 19, 2006 18:27

I think I have fallen into one of the great traps of release engineering and/or site administration. It'd be almost comical if it wasn't... No, I take that back. I think it's pretty funny.
A little context (and since we're still in stealth-mode, it'll be very little context): We're a small startup. We're building a service. Naturally, we hope it will be very popular, so one of our watchwords is "scalability." Some of us have seen what happens when your popularity outstrips your ability to scale fast enough. For those of you who haven't, it's a nightmare. Part of my job is to make sure we can bring up new servers just as fast as we need them without working 60-hour work weeks, and part is to push software updates to the servers as they become available.
Because we hope the number of servers will someday soon be quite large, because we want the updates to be delivered consistently and because I'm lazy 1, I'm all about the automation. As small as we are, there are already too many systems for me to maintain by hand if I want to get any other work done, like for example making sure I have automation in place to allow me to take on management of even more servers. When this thing takes off, we will be automated or we will require vast armies of specially-trained monkeys to keep us from drowning. While trained monkeys can be cool, I don't see the VCs going for it. Besides, our building seems to have rules against animals. But I digress.
Because in my experience most engineers don't grok automation or, more to the point, automatability 2, I started early with some fairly stubborn guidelines about what I would and would not do to roll software out so I could get the engineers used to it early. This means I regularly have conversations with various members of engineering about hard-coding practices, configurability and environmental expectations. And That's OK. Occasionally, these conversations are somewhat heated. That's less ok but not unexpected, and we generally come out the other side relatively unscathed. We're all on the same team, and we're all working toward the same goal even if we disagree on some of the details.

Now here's how the trap plays out.
As will happen from time to time in any environment, we have a component that is acting up. Of course, as Murphy and our environment would have it, this component is only failing on the production servers. Of course, I continue to deny engineering access to the production servers, As God Intended. Ok, that's not entirely true. I do currently allow our Chief Architect access for debugging purposes, but we have An Agreement: Directly changing anything outside of the product's directories is detrimental to progress so don't do it, and anything changed inside the product's directories will be overwritten by the next update, so make sure those critical changes get into the repository. She's on board with my reasons, and so I allow her access. But only her.
As will happen from time to time, we had a couple of important demos approaching. And yet this component was acting up. Of course, this component was important to one of the pieces that is regularly shown off during demos, so stability is important. So our Chief Architect put together a watcher. Since the only place the watcher could be tested was on the production system, she did her building and testing there, all the while assuring me that we would have a chance to check it into the repository before the next update, which was a while off.
As will happen from time to time, she was pulled away to another very important task before she had a chance to move the watcher into the repository. But that was ok because the next update was a while off.
As will happen from time to time, marketing and the CEO told one of the Team Leads that one of their changes to a different component needed to be rolled into production. Of course, this update was very important for the upcoming demos so it had to happen right away. Of course, this information made its way to me the day before the demo. But that was ok because we've gone through this update process dozens of times by now.
Now here's the conflict: The product is comprised of a number of separate but interdependent components. Because they are so interdependent, my automation performs a number of operations on all the components as a whole. For example, branching. For another example, updating the servers from the repository. Also, since one doesn't want little turds from the previous version stinking up the place and potentially interfering with the new version, my automation cleans the slate before installing an update. This operation was agreed to long ago, so we have done it this way for quite some time.
So we have a component that must be updated in support of the demo and automation that will wipe out the manually-installed thing that is currently keeping up another component which is also important to the demo. Well, this would probably be a learning experience for someone. I hoped it wouldn't be me.
TL said we needed to update. I said to check with the CA because I couldn't update without clobbering her hack. TL asked if we could update everything but the CA's component. No, because the automation assumes it's ok to clobber the install directory. TL suggested I temporarily kluge the automation to update around the CA's component. But I was already juggling two versions of the automation because we were halfway through migrating people through the effects of a major code merge. I really didn't want to throw a third version into the mix. TL suggested I perform the update by hand. No! Are you kidding? Aside from being a pain in the ass, the automation is there so I can get consistent results and eliminate One More Thing That Can Go Wrong.
There are a few other arguments on this point that flashed through my head, but as our language is the rather inefficient thing that it is, I didn't have time to voice them. One such argument is that I believe it is during crunch times when people are overloaded and rushed that we need process and procedure most. People make mistakes and bad judgement calls under the best of circumstances. We are only more prone to make them when we're under stress. Generate and review your procedures when you're calm and rational. Stick to those procedures as closely as you can when in the thick of things. Wait to review the procedures and figure out how to adapt them until after the crisis has passed and you've had time to calm down (and sleep, if necessary). If you throw the whole book out the window the first time it gets in your way, you are bound to get burned sooner or later no matter how well you think on your feet.
Another such argument is that I don't want anyone to get used to the idea of any production servers running in a state that cannot be automatically replicated. There are two parts to this. 1) Any step that is performed by hand and not the automation is a step that can be forgotten or skipped accidentally. In the case of a system being brought back up after a calamity (server catches fire or falls victim to random gunfire 3), those steps are far more likely to be forgotten. 2) Any step that must be performed by hand is a step that will not be performed when the server farm grows above 20, and it cannot be performed by hand when the server farm grows above 100 without the aforementioned highly-trained monkeys. So any time we're running in such a dirty state, I am highly motivated to get us back to a properly-updated state and to ensure that all previously manual steps have been incorporated into the automation.
Instead of bringing up all that, I simply argued that it would take around 15-30 minutes for to set up for such a manual update (instead of the usual 1-2 minutes to perform the update), but more than that we were getting into larger and larger violations of process and it was time for me to dig in my heels. The right thing is to get the CA to get her hack into the repository and let me push it back out properly. Now go away.
A few minutes later, the TL returned with the CA, who, after listening to my protestations, explained rather loudly that she didn't have time to get her code into the repository. It was imperative that we get the update out, but it was equally imperative that her hack not be impacted by it. She pointed out that I was hired because I'm smart enough to figure out how to do things like this, so figure it out and do it. Then she stormed off.
The issue was never about ability. The problem is... Well, we'll get to that in a bit. But since I report to engineering, I had my orders. I told the TL it'd take a while and I'd let him know when I was done. After I wrapped that up, I went back to what I was working on before. A few hours later, the CA came back and calmly explained that she didn't like the rigged state we were in either, and she assured me that as soon as the demos were done in a couple days, we could do a clean re-update. Shortly before leaving for the evening, I was asked to do one more update. It took about fifteen minutes to perform the manual update, and another ten for various parties to give it a sniff-test so I could go home. It all looked good. I left.
The demos went fine.
The day after the demos was pretty full, as usual. A couple of medium-level crises, 2-3 time-critical administration tasks, a few urgent bug fixes brought on by one of those, a few ad-hoc conversations about OS-level stuff, etc. With about an hour left in the day, I started looking at what it would take to get the CA's hack into the repository so we could do a clean update. After a few moments, I realized that it wasn't a 15-minute thing, and since the CA and I were both a bit fried (long day, long week), doing an update on a production system after everyone had gone home for the weekend was probably not the brightest idea anyway. The proper course of action would depend on when the next demo was going to be. So I asked the acting VP of Engineering for some insight on the next week's schedule, explaining that the goal here was to get back into a state that was once again easily updatable and replicable and laying out our options for getting there:

Re-update as things stand. The CA's hack would go away and her component would go back to being unstable, but at least we'd be in a replicable/updatable state.
Stay late and integrate the CA's hack, then update. We'd be in a replicable/updatable state before any demos that might take place early in the week, but that state might not work properly, and I'd have to bug people after hours to test it. If it broke production, we'd have to declare an after-hours emergency. (Not as hugely dramatic as that at our stage, but still widely annoying.)
Integrate and update over the weekend. Better than the second option, but only because the interrupts would happen over the weekend rather than on friday night.
Relax over the weekend, then integrate/update on monday. Only really a problem if we want to demo on monday or tuesday.

VPoE: "So why don't you want to update tonight?"
me: "Because there's always a possibility that things may go Horribly Wrong during the update, and if they do, I don't want to drag everyone back in tonight to get it fixed."
VPoE: "Why is monday better?"
me: "Because if something does happen to go Horribly Wrong, everyone will already be here to help fix it."
VPoE: "Well, given that we have a demo coming up on thursday, you're concerned that things might go Horribly Wrong during an update and things are stable now..."

(Anyone? Anyone?)
VPoE: "...I think we should leave it the way it is until next friday."

It keeps reminding me of a scene from The West Wing: CJ: Duchamp was the father of Dadaism.
Toby: I know.
CJ: The da-da of Dada.
Toby: It's like there's nothing you can do about that joke. It's coming, and you just have to stand there.

It also makes me think I should either be more careful about voicing my cynicism or stop telling people how we make the sausage. I think we'll get it sorted out to my satisfaction on monday. Until then, I just have to laugh.

1I consider cynicism and laziness to be important features for anyone wanting to be effective in my line of work. I think I'm pretty good at my job.
2Additionally, I'm somewhat fascinated at how quickly a software project can grow to the point where very few engineers know what it takes to install it. I suppose this is just another artifact of standard engineering hyperfocus, but at a gut level I still find it strange when the people involved in building something don't know how to make the whole thing work.
3In truth, we're fortunate to be located in a good area of the colo where gunplay is pretty rare.