LJ Idol **non-fiction** Failure on a High level

Jan 31, 2020 05:25


I don’t talk about my job very often. This is on purpose. I have always felt that a person isn’t what they do for a living. To quote from one of my favorite movies, Fight Club, “You're not your job. You're not how much money you have in the bank. You're not the car you drive. You're not the contents of your wallet. You're not your f***ing khakis. You're the all-singing, all-dancing crap of the world.”

Today I’m going to part the curtain just a little, to show just a small aspect of what I do for a living. All identifying information has been removed, to protect the guilty parties. I can only hope that when I’m done it’s still telling a readable story.

I work for a state government agency in Texas, as the overnight supervisor for the information technology department. This puts me in a tough place sometimes, with a ton of responsibility, and the unfortunate byproduct of the world sometimes falling on my head. When it comes to technology in my agency after 7pm, the buck stops with me.

The agency I work for has, as one of its many functions, a data pass through from a three letter federal agency down to local government agencies, counties, cities and towns across the state. This data link is required to be active 24 hours per day for reasons of “public safety”. Let me say that again, the data is required to be available 24 hours per day. The other night, that data connection failed. My bosses would say that failure here is not an option, but it failed. As a part of that failure we began to build up a backlog of messages destined for our federal partner.

I immediately called our federal partner, because they manage the connection itself, only to be told, “oh, yeah, we have a known issue in Texas, the vendor is working on it, but we don’t know when it will be fixed.”

This firmly established that it wasn’t something that was under my control, it was on the federal side, they had ownership of it. About that time we began to get calls from many of our city and county agency counterparts across the state asking what was going on and wanting to know why they weren’t getting any responses from the federal agency on their messages to them. So I sent out notifications that, “yes, the data is not available, but the federal agency is aware of it and they’re working to resolve it.”

After a couple of hours, I called them back, looking for an update. I was told they had escalated with the vendor, but still had no estimate for the time of resolution. Fine. Until they called me back. “Oh, we think it’s the router on your end, can you reset the receive side?”

No. No I can’t. That piece of equipment doesn’t belong to my agency, it belongs to the federal partner. I don’t have access to it to reset anything. I made a couple of phone calls (in the middle of the night) and got access, eventually resetting it before calling the federal partner back. They then realized that yes, it was a problem with the vendor and they took responsibility for the issue back.

At that point I went home for the day after handing the issue off to my boss and the daytime supervisor. The backlog of messages for the federal partner was now approaching a panic state.

Of course when I went back in to work I had to look up what the eventual resolution of the issue was.

The federal agency didn’t get the communications circuit working again until late in the afternoon. I’m sure our upper management had an issue with that, but since it was on the federal side of things, who are we at the state to question them. Unfortunately that didn’t completely resolve the issue. They could confirm in testing that things should work, but none of our communications were going through. This led to a comedy of failure.

First they thought it was our mainframe, so they got our mainframe programmers into the game and had them reset a bunch of processes in an attempt to get things working again. This of course didn’t work.

Then they thought it might be the communications software itself that wasn’t talking properly from our end at the state to the federal side. After having our vendor go through everything on their software and resetting everything they could without affecting our ability to communicate to other Texas state agencies or the local cities across the state they all realized that this wasn’t it.

Having eliminated all of the obvious things we were back down to a potential hardware issue. One of our engineers took a look at a piece of security equipment that was between the federal system and our state system, since that seemed to be where communications were stopped, and found that a tiny line of programming there was blocking all of our communication. After re-configuring that programming, suddenly everything worked, and our now immense backlog of messages went through.

In total, this communications link that was never to be down, was out of service for nearly 23 hours. Ouch.

Through this whole process there were failures. Up front when the federal agency we partner with blamed it on a vendor without checking anything it turns out they were looking at a connection to another federal agency who just happens to have an office in Texas. This was no-where near us and had no effect on us at all. Had they thought to check the correct equipment up front I might have saved several hours in figuring it out. That failure led to a data delay that had potential impact to thousands of citizens across the state. I’m certain that this isn’t the last I’ll hear about it…

lj idol

Previous post Next post
Up