Parsing english text without spaces: vashu11

vashu11

Parsing english text without spaces

Jun 18, 2015 11:18

I like to copypaste quotes from books that I read. And many times Acrobat Reader frustrated me, saving to clipboard something like

Evenif ouroldideasaboutthemindarewrong,wecanlearnalotbytryingtounderstand w h y w e b e l i e v e t h e m . I n s t e a d o f a s k i n g , " W h a t a r e S e l v e s ? w" e c a n a s k , i n s t e a d, " W h a t d r e o u r ideasaboutSelyes?"

If the book was scanned and converted into .pdf, then Acrobat can provide you a text only after running its OCR and it's rather bad. So you get either text with no spaces at all or too many spaces.

So I wrote a script for managing this problem.

You download it, then download dictionary file(run 'curl http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt > wordsEn.txt' or download from here manually).

Then you run 'python acrobat_clipboard_corrector.py copypaste.txt' and the text above turns into

Even if our old ideas about them in dare wrong , we can learn al otby trying to understand why web el ie vet hem . Instead of asking , "What areS elves ? w"ecan ask , instead , "What dre our ideas about Se lyes ? "

It removes all spaces from text, then tries to split text at word boundaries. It is not perfect, but better than nothing.

python, programming