A quick break: pozorvlak

pozorvlak

A quick break

Jul 13, 2010 16:39

It occurs to me that not enough people know about TeX's line-breaking algorithm.

Take a paragraph of text; this one, for instance. Form a graph whose vertices are possible line-breaks, and whose edges are the words (and part-words, in the case of hyphenation) between two possible breaks. For instance, if your text is "Fred loves Wilma", then you'd have four vertices: the beginning, the space between "Fred" and "loves", the space between "loves" and "Wilma", and the end. You'd have five edges: "Fred", "Fred loves", "Fred loves Wilma", "loves Wilma" and "Wilma".

Now, here's the clever bit: you decorate each edge with the "badness" associated with fitting those words onto a line, represented as a number between 0 and 10,000. There are various heuristics used to calculate this "badness" score, but basically it comes down to how much you have to squash or stretch the words and the spaces between them, plus extra badness penalties for hyphenation etc.

Finding the optimal set of line breaks is now a simple matter of applying a standard minimum-weight graph-traversal algorithm. Some further cleverness brings the average-case running time for the algorithm down to linear, and the worst-case to O(n2).

Theoretical computer science for the win.

Unfortunately, TeX doesn't apply this algorithm to the problem of breaking paragraphs into pages, for (AIUI) historical reasons: the machines on which TeX was developed didn't have enough memory to hold more than one or two pages in memory at a time, and so had to write out each page as it was created. The page-breaking algorithm is thus local rather than global, and hence sometimes gives strange results. Wikipedia informs me that optimal page-breaking is NP-complete in the presence of diagrams, but in practice we efficiently find good-enough solutions to NP-complete problems every day, so I don't know why they're still using the local algorithm. Hopefully some TeXpert will pop up in the comments and enlighten me :-)

computers, programming, tex, beware the geek