I like to copypaste quotes from books that I read. And many times Acrobat Reader frustrated me, saving to clipboard something like
Evenif ouroldideasaboutthemindarewrong,wecanlearnalotbytryingtounderstand w h y w e b e l i e v e t h e m . I n s t e a d o f a s k i n g , " W h a t a r e S e l v e s ? w" e c a n a s k , i n s t e a d, " W h a t d r e o u r ideasaboutSelyes?"
If the book was scanned and converted into .pdf, then Acrobat can provide you a text only after running its OCR and it's rather bad. So you get either text with no spaces at all or too many spaces.
So I wrote
a script for managing this problem.
You download it, then download dictionary file(run 'curl
http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt > wordsEn.txt' or download
from here manually).
Then you run 'python acrobat_clipboard_corrector.py copypaste.txt' and the text above turns into
Even if our old ideas about them in dare wrong , we can learn al otby trying to understand why web el ie vet hem . Instead of asking , "What areS elves ? w"ecan ask , instead , "What dre our ideas about Se lyes ? "
It removes all spaces from text, then tries to split text at word boundaries. It is not perfect, but better than nothing.