I think I have fallen into one of the great traps of release engineering
and/or site administration. It'd be almost comical if it wasn't... No,
I take that back. I think it's pretty funny.
A little context (and since we're still in stealth-mode, it'll be very
little context): We're a small startup. We're building a service.
Naturally, we hope it will be very popular, so one of our watchwords is
"scalability." Some of us have seen what happens when your popularity
outstrips your ability to scale fast enough. For those of you who
haven't, it's a nightmare. Part of my job is to make sure we can bring
up new servers just as fast as we need them without working 60-hour
work weeks, and part is to push software updates to the servers as they
become available.
Because we hope the number of servers will someday soon be quite large,
because we want the updates to be delivered consistently and because I'm
lazy
1, I'm all
about the automation. As small as we are, there are already too many
systems for me to maintain by hand if I want to get any other work done,
like for example making sure I have automation in place to allow me to
take on management of even more servers. When this thing takes off, we
will be automated or we will require vast armies of specially-trained
monkeys to keep us from drowning. While trained monkeys can be cool,
I don't see the VCs going for it. Besides, our building seems to have
rules against animals. But I digress.
Because in my experience most engineers don't grok automation or, more
to the point,
automatability
2,
I started early with some fairly
stubborn guidelines about what I would and would not do to roll
software out so I could get the engineers used to it early. This means
I regularly have conversations with various members of engineering about
hard-coding practices, configurability and environmental expectations.
And That's OK.
Occasionally, these conversations are somewhat heated. That's less ok
but not unexpected, and we generally come out the other side relatively
unscathed. We're all on the same team, and we're all working toward the
same goal even if we disagree on some of the details.
Now here's how the trap plays out.
As will happen from time to time in any environment, we have a
component that is acting up. Of course, as Murphy and our environment
would have it, this component is only failing on the production servers.
Of course, I continue to deny engineering access to the production
servers, As God Intended. Ok, that's not entirely true. I do currently
allow our Chief Architect access for debugging purposes, but we have An
Agreement: Directly changing anything outside of the product's directories
is detrimental to progress so don't do it, and anything changed inside
the product's directories will be overwritten by the next update, so
make sure those critical changes get into the repository. She's on
board with my reasons, and so I allow her access. But only her.
As will happen from time to time, we had a couple of important demos
approaching. And yet this component was acting up. Of course, this
component was important to one of the pieces that is regularly shown off
during demos, so stability is important. So our Chief Architect put
together a watcher. Since the only place the watcher could be tested
was on the production system, she did her building and testing there,
all the while assuring me that we would have a chance to check it into
the repository before the next update, which was a while off.
As will happen from time to time, she was pulled away to another very
important task before she had a chance to move the watcher into the
repository. But that was ok because the next update was a while off.
As will happen from time to time, marketing and the CEO told one of the
Team Leads that one of their changes to a different component needed to
be rolled into production. Of course, this update was very important
for the upcoming demos so it had to happen right away. Of course, this
information made its way to me the day before the demo. But that was ok
because we've gone through this update process dozens of times by now.
Now here's the conflict: The product is comprised of a number of separate
but interdependent components. Because they are so interdependent,
my automation performs a number of operations on all the components
as a whole. For example, branching. For another example, updating
the servers from the repository. Also, since one doesn't want little
turds from the previous version stinking up the place and potentially
interfering with the new version, my automation cleans the slate before
installing an update. This operation was agreed to long ago, so we have
done it this way for quite some time.
So we have a component that must be updated in support of the demo
and automation that will wipe out the manually-installed thing that is
currently keeping up another component which is also important to the
demo. Well, this would probably be a learning experience for someone.
I hoped it wouldn't be me.
TL said we needed to update. I said to check with the CA because I
couldn't update without clobbering her hack. TL asked if we could update
everything but the CA's component. No, because the automation assumes
it's ok to clobber the install directory. TL suggested I temporarily
kluge the automation to update around the CA's component. But I was
already juggling two versions of the automation because we were halfway
through migrating people through the effects of a major code merge.
I really didn't want to throw a third version into the mix. TL suggested
I perform the update by hand. No! Are you kidding? Aside from being a
pain in the ass, the automation is there so I can get consistent results
and eliminate One More Thing That Can Go Wrong.
There are a few other arguments on this point that flashed through my
head, but as our language is the rather inefficient thing that it is,
I didn't have time to voice them. One such argument is that I believe
it is during crunch times when people are overloaded and rushed that
we need process and procedure most. People make mistakes and
bad judgement calls under the best of circumstances. We are only more
prone to make them when we're under stress. Generate and review your
procedures when you're calm and rational. Stick to those procedures as
closely as you can when in the thick of things. Wait to review
the procedures and figure out how to adapt them until after the crisis
has passed and you've had time to calm down (and sleep, if necessary).
If you throw the whole book out the window the first time it gets in
your way, you are bound to get burned sooner or later no matter how well
you think on your feet.
Another such argument is that I don't want anyone to get used to the idea
of any production servers running in a state that cannot be automatically
replicated. There are two parts to this. 1) Any step that is performed
by hand and not the automation is a step that can be forgotten or skipped
accidentally. In the case of a system being brought back up after a
calamity (server catches fire or falls victim to random
gunfire
3), those
steps are far more likely to be forgotten. 2) Any step that must
be performed by hand is a step that will not be performed when the server
farm grows above 20, and it cannot be performed by hand when the server
farm grows above 100 without the aforementioned highly-trained monkeys.
So any time we're running in such a dirty state, I am highly motivated to
get us back to a properly-updated state and to ensure that all previously
manual steps have been incorporated into the automation.
Instead of bringing up all that, I simply argued that it would take
around 15-30 minutes for to set up for such a manual update (instead of
the usual 1-2 minutes to perform the update), but more than that we were
getting into larger and larger violations of process and it was time for
me to dig in my heels. The right thing is to get the CA to get her hack
into the repository and let me push it back out properly. Now go away.
A few minutes later, the TL returned with the CA, who, after listening
to my protestations, explained rather loudly that she didn't have time
to get her code into the repository. It was imperative that we get the
update out, but it was equally imperative that her hack not be impacted
by it. She pointed out that I was hired because I'm smart enough to
figure out how to do things like this, so figure it out and do it.
Then she stormed off.
The issue was never about ability. The problem is... Well, we'll get
to that in a bit. But since I report to engineering, I had my orders.
I told the TL it'd take a while and I'd let him know when I was done.
After I wrapped that up, I went back to what I was working on before.
A few hours later, the CA came back and calmly explained that she didn't
like the rigged state we were in either, and she assured me that as soon
as the demos were done in a couple days, we could do a clean re-update.
Shortly before leaving for the evening, I was asked to do one more update.
It took about fifteen minutes to perform the manual update, and another
ten for various parties to give it a sniff-test so I could go home.
It all looked good. I left.
The demos went fine.
The day after the demos was pretty full, as usual. A couple of
medium-level crises, 2-3 time-critical administration tasks, a few urgent
bug fixes brought on by one of those, a few ad-hoc conversations about
OS-level stuff, etc. With about an hour left in the day, I started
looking at what it would take to get the CA's hack into the repository
so we could do a clean update. After a few moments, I realized that it
wasn't a 15-minute thing, and since the CA and I were both a bit fried
(long day, long week), doing an update on a production system after
everyone had gone home for the weekend was probably not the brightest
idea anyway. The proper course of action would depend on when the next
demo was going to be. So I asked the acting VP of Engineering for some
insight on the next week's schedule, explaining that the goal here was to
get back into a state that was once again easily updatable and replicable
and laying out our options for getting there:
- Re-update as things stand. The CA's hack would go away and her
component would go back to being unstable, but at least we'd be in a
replicable/updatable state.
- Stay late and integrate the CA's hack, then update. We'd be in
a replicable/updatable state before any demos that might take place
early in the week, but that state might not work properly, and I'd have
to bug people after hours to test it. If it broke production, we'd have
to declare an after-hours emergency. (Not as hugely dramatic as that
at our stage, but still widely annoying.)
- Integrate and update over the weekend. Better than the second
option, but only because the interrupts would happen over the weekend
rather than on friday night.
- Relax over the weekend, then integrate/update on monday. Only
really a problem if we want to demo on monday or tuesday.
VPoE: "So why don't you want to update tonight?"
me: "Because there's always a possibility that things may go
Horribly Wrong during the update, and if they do, I don't want to drag
everyone back in tonight to get it fixed."
VPoE: "Why is monday better?"
me: "Because if something does happen to go Horribly Wrong,
everyone will already be here to help fix it."
VPoE: "Well, given that we have a demo coming up on thursday,
you're concerned that things might go Horribly Wrong during an update
and things are stable now..."
(Anyone? Anyone?)
VPoE: "...I think we should leave it the way it is until next
friday."
It keeps reminding me of a scene from
The West Wing:
CJ: Duchamp was the father of Dadaism.
Toby: I know.
CJ: The da-da of Dada.
Toby: It's like there's nothing you can do about that joke.
It's coming, and you just have to stand there.
It also makes me think I should either be more careful about voicing
my cynicism or stop telling people how we make the sausage. I think
we'll get it sorted out to my satisfaction on monday. Until then, I
just have to laugh.
1I consider cynicism
and laziness to be important features for anyone wanting to be effective
in my line of work. I think I'm pretty good at my job.
2Additionally,
I'm somewhat fascinated at how quickly a software project can grow to
the point where very few engineers know what it takes to install it.
I suppose this is just another artifact of standard engineering
hyperfocus, but at a gut level I still find it strange when the people
involved in building something don't know how to make the whole thing
work.
3In truth, we're
fortunate to be located in a good area of the colo where gunplay is
pretty rare.