Learned some interesting things about scanning books and OCR processing today: thewayne

thewayne

Learned some interesting things about scanning books and OCR processing today

Mar 04, 2019 23:43

I'm doing an internship in our local university library through April, and my main task is scanning their annual 'Reports To The President', a précis of college activity sent up to main campus and bound in a book, usually hard-bound. The oldest book was 1965-66, the newest that I've seen thus far is '98-99. I believe there are newer already in PDF format online on the local network. Apparently by scanning them and then coding an RDA record for each file, we can get them hosted by the state academic library organization, or somebody, for free.

So that's cool.

I'm using a fairly spiffy Fujitsu specialized document scanner that can scan two pages of a bound book in one pass, but I don't think their software is as good as they claim it to be. It can handle a pretty significant amount of curvature in the books - for example, I was scanning three pages starting at page 385 of 427, so LOTS of curve when you're that far in. I was holding up the left side to get the right page reasonably flat, then holding down both sides with one finger.

And yes, the fingers were captured by the scan.

After you've done your scan, you get into the next phase, where you drag this wire frame to line up one line down the spine between the two pages, then you align four corners to the outside corners of the pages. The program does a good job of detecting the edges and snapping to it, but sometimes you have to do some dragging to improve alignment. Once you've aligned all the scanned pages correctly, you click an Apply button and it re-cuts the scans into individual pages and flattens them, programmatically removing the curve. It does a very good job, though not perfect.

THEN you have to go back through every page and remove the fingertips! It has a special tool just for it and works a lot like Photoshop's patch tool, but it auto-selects the fingertip. Click Apply, and the fingertip vanishes.

Once you've removed the fingertips, you can save it to PDF. Theoretically the program performs OCR (optical character recognition), but I can't see that it has any effect. I end up loading the PDF into Acrobat Pro and running OCR there.

And this is where I learned something tonight. While you can't do a spell-check on a scanned document because you're dealing with a scanned image, not words, there's something that's similar: a fragment check. Fragments are words that Adobe Acrobat recognizes as 'I think this is a word or something, but I'm not sure, therefor I didn't map it into the OCR side of the document. Fix it.' Acrobat can't provide a dictionary of suggestions like Word, so when it sees something that it thinks was a word but it couldn't map, you have to type the correction. Or page number. Or budget number. Or tell it to ignore it.

It took me a good half an hour to fix a three page document. I don't know how many times I typed the San of San Juan. Just the San, apparently Juan was recognizable.

And that was a three page document from '67-68. The latest document from '98-99? That was 40some pages, I'm going to run a fragment check on it tomorrow afternoon and we shall see how long it takes to fix.

One very odd thing about scanning two pages at once in bound books - the page sequence is reversed! This is easily fixed in Acrobat Pro when you're dealing with a handful or two of pages, you just slide page thumbnails around. But dealing with 30 or 40 pages? Next week I'll try scanning a book starting with the last pages and working my way forward and seeing how that works.

So important tip when creating PDFs from scanned docs for public consumption: running OCR is only half the job. If you need the document to be searchable, you MUST spend the time to run a fragment check on it and fix all of the problems! Otherwise you're going to frustrate anyone needing to do anything serious with the document.

One thing that makes me really wish I had a working Mac laptop: I'd like to take an unfixed doc and run it through text to speech and see how it works. Then run the fixed doc through TTS. Might be interesting.

This entry was originally posted at https://thewayne.dreamwidth.org/1121724.html. Please comment there using OpenID.

education, libraries, computers