I don't usually post about work, because I'd rather not bore you to tears, but this story is too good and I have to share it. It's about what happens when everyone does their job well.
R added code to check for errors from the motion controllers, retry up to 3 times, and throw an exception if retrying doesn't succeed.
J noticed that the number of microns per motor step was not exactly as specified, and added code so it is measured during calibration, and used when moving, so moves are more precise.
D added code to move the stage to the eject position when the operator scans the cartridge barcode. This way the operator could easily insert the cartridge, without having to manually move the eject lever. The eject position is a block at one corner of the stage (X = 120,000 µm, Y = 0 µm). Moving the stage to that position pushes the lever against the block, opening the place where the cartridge goes.
All three developers carefully tested their code, using unit tests and testing on the instrument. All the code was reviewed, the pull requests were approved, and everything was merged to the master branch. Let's test it one more time on the instrument.
Launch the software. Scan the cartridge barcode. Get an error "Enter barcode failed because: one or more errors occurred." It is not possible to run the instrument. Did I mention that we were hoping to release this version of the software into the lab?
What happened?
When I scanned the cartridge barcode, D's new code moved to Y = 0 µm, then tried to move to X = 120,000 µm so it could push the eject lever against the block. J's code updated the number of microns per motor step to be more accurate. On this instrument the number was slightly smaller, so more motor steps are required to get to a position in microns. But the eject position is at the limit of motion, so if we tell it to go any farther than that, it's an error. And R's code made sure the error was detected and an exception was thrown, stopping everything. The not very helpful "one or more errors occurred" message was because I used asynchronous tasks and their default behavior is to wrap up any exceptions that occur in an aggregate exception with that message.
How did we fix it?
J changed the config file so the eject position was 199,990 µm instead of 120,000. Good idea. That is a difference of only .01 mm so the lever should still work fine. J goes home for the weekend. I test it on the instrument, and still get the error. It's a good change, but it's not enough.
I ended up writing some code to apply the calibration factor to the eject position in reverse. This gives a target position in microns that we can be sure will be within the max limit in motor steps. For example, given a target position of X = 199,990 µm, it would compute that we should tell the motion controller to move to X = 199,905 µm. This seemed to be the simplest solution. It actually worked when I tested it. We'll see what everyone thinks about it on Monday.
I added some code to unpack the aggregate exception. Now it says "Enter barcode failed because: move command failed BADDATA." That is the actual error message we get from the motion controller. We'll have to add some code to generate a better error message.
The project is actually going well. This is a classic integration problem. Everyone delivered solid code. Each feature worked properly when tested in isolation. Only when they were all combined did everything fall apart.
This entry was originally posted at
http://voidampersand.dreamwidth.org/24073.html.