How does TeX's hyphenation algorithm work?

Question

Tex uses a built-in internal algorithm to decide where words can be hyphenated. This algorithm sometimes fails, as discussed in the questions Breaking words at the end of line and How to manually set where a word is split?. (There's also an online list of known algorithm failures.) How does the algorithm work, in broad strokes? I know it's language-dependent, so for concreteness let's say the American and/or British English algorithms.

Have a look at https://www.tug.org/docs/liang/liang-thesis.pdf — TeXnician, Oct 28 '17 at 16:30
The short answer is that it just takes data from a dictionary, compactly encoded into “patterns” with Liang's method. See Appendix H (Hyphenation) of the TeXbook, Part 42 of the TeX program, or Frank Liang's thesis Word hy-phen-a-tion by com-pu-ter. I think though that if what you're looking for is to be able to guess/predict where TeX will break words, you need to look simply at the hyphenation patterns that are loaded, rather than Liang's algorithm. — ShreevatsaR, Oct 28 '17 at 16:41
Regarding what you call a "list of known algorithm failures": I would disagree with the term "algorithm failure". Hyphenation exception would be a much more accurate label. Anyway, getting more comprehensive lists of hyphenation exceptions is one (of many) good reasons for loading the babel package with a suitable language choice. — Mico, Oct 28 '17 at 20:10
@Mico Why do you think "hyphenation exception" is more accurate that "algorithm failure"? — tparker, Oct 28 '17 at 22:16
@tparker - There are two types of errors when dealing with hyphenation issues: (i) failure to find valid hyphenation points and (ii) selection of invalid hyphenation points. Arguably, the second error type is worse, as it's immediately visible. Liang's algorithm and the hyphenation patterns that are loaded if no language option is set are designed to minimize the frequency of the second error type. Study the list of words you referenced: most cases involve the first error type, as per the algorithm's design. (FWIW, I contributed quite a few of the words on this list. My favorite is "bedwarf".) — Mico, Oct 29 '17 at 08:18
Some previous explanations: https://tex.stackexchange.com/questions/262588/how-are-hyphenation-patterns-written and https://tex.stackexchange.com/a/74369/48 — ShreevatsaR, Oct 30 '17 at 22:27

score 15 · Accepted Answer · answered Oct 28 '17 at 21:24

The algorithm is not language dependent, but the data used is dependent on the language.

There are two basic components, a list of hyphenation exceptions some of which are specified in the language definition and others can be added at any time in a document, if you go \hyphenation{one-tw-o-thr-ee} then that word (and upper/lowercase variants) will be hyphenated as shown, note no other linguistic variants such as plurals are affected by this. if you want "onetwothrees" to be hyphenated in a similar way that would also need to be listed.

Hyphenation exceptions are useful for special words and give total control in the document but clearly just listing every word in the language isn't realistic so the main mechanism is patterns

For each language the format inputs a file that executes \patterns. The original US english ones being at a location such as

/usr/local/texlive/2017/texmf-dist/tex/generic/hyphen/hyphen.tex

and looking like

\patterns{
.ach4
.ad4der
.af1t
.al3t
.am5at
.an5c
  four thousand more of these lines

If you ignore the digits, each of these runs of letters is matched against the words in the paragraph (. meaning start or end of a word). For each word any pattern that matches a substring assigns a digit 0-9 between the letters of a word (no digit being the same as 0). If two or more of these patterns match a word, the highest valued digit is assigned to each inter-letter space.

So after all patterns have been matched against a word there is a value 0-9 assigned between each letter. If this value is odd then hyphenation is allowed at that point, if it is even no hyphenation is allowed at that point.

There are additional integer parameters that specify how close to the start or end of a word a hyphen may be placed.

TeX also uses some clever optimisations that mean it does not have to pattern match every word, it only needs to find the hyphenation points in the words that could be a feasible break point in a paragraph, but that's an internal optimisation that doesn't affect the basic hyphenation algorithm.

For some languages that have regular spelling and hyphenation rules, the patterns can be hand written to reflect those rules. English defeats description by rules so for cases like this patterns are usually made by taking an existing dictionary of hyphenated words (eg as supplied by a publisher), and using the patgen program to compress the dictionary by producing a set of patterns that produces (say) 80% of the hyphens in the original dictionary.

Very nice answer. a digit 0-9 between the letters of a word (no digit being the same as 0) — this means a digit 1–9 I guess? I also wonder whether today it may actually make sense to just list every word in the dictionary! Ploughing through even a million words (and even the largest dictionaries have fewer words than that) should add a couple of milliseconds per paragraph, and looking at some documents in questions on this site, it appears that many people are willing to accept that much delay. :-) — ShreevatsaR, Oct 31 '17 at 03:51
@ShreevatsaR digits 0-9 are legal. I meant a pattern ab1cd is the same as 0a0b1c0d0 — David Carlisle, Oct 31 '17 at 07:49
@ShreevatsaR it would be interesting to try a big dictionary, in classic tex just a very long \hyphenation{..} exception list or in luatex of course you have a hyphenation callback and can write whatever algorithm you like in Lua — David Carlisle, Oct 31 '17 at 07:50

How does TeX's hyphenation algorithm work?

1 Answers1

Linked