27

I've just come across an interesting snippet in New Hart's Rules (p.59):

The New Oxford Spelling Dictionary  therefore uses two levels of word division -- 'preferred' divisions (marked |), which are acceptable under almost any circumstances, and 'permitted' divisions (marked ¦), which are not as good, given a choice. Thus unhelpful is shown as un|help¦ful.

This question extends LaTeX Hyphenation.

This seems like an immensely good idea for improving hyphenation; is there a TeX format or package that provides an interface to this type of function? In particular, is there any way I can provide hyphenation priorities à la Liang for an arbitrary word of my choice?

I couldn't find anything helpful in TeX by Topic, though Appendix H of The TeXbook seems to mention hyphenation priorities in relation to Liang's algorithm (1994, p.449)

  • As I understand it, hyphenation priorities can only be given in pattern files, not in the document (via \hyphenation{...} command). – topskip Aug 02 '12 at 11:01
  • 8
    No, is the simple answer, it comes up a lot in languages that make more use of compound words as well, where it would be nice to treat splitting at the joins differently from splitting one of the components. As patrick says within the pattern files you can approximate this to an extent by giving higher priorities to patterns that are the preferred points. In luatex you could in principle (I think) use its callback mechanism to try hyphenating first with patterns just using preferred hyphenation and then switching languages to load a different set of hyphenation patterns with the full set – David Carlisle Aug 02 '12 at 11:32
  • 6
    Frank Mittelbach talked about this at TUG2012: he had it down as an unsolved problem. – Joseph Wright Aug 02 '12 at 11:39
  • 4
    IIRC, Taco once said he is considering weighted hyphenation in LuaTeX, but AFAIK non of this is available yet. – خالد حسني Aug 02 '12 at 12:19
  • @David your idea is nice for testing with Lua, but the patterns do not have ways to priorities breakpoints, Patrick and you are mistaken here. – Frank Mittelbach Aug 04 '12 at 21:45
  • @FrankMittelbach you are right of course, Patrick lead me astray:-) – David Carlisle Aug 05 '12 at 00:54
  • @FrankMittelbach thanks for the clarification! My bad... – topskip Aug 05 '12 at 07:31

1 Answers1

18

TeX (and all of its variants including LuaTeX) implement Liang's hyphenation algorithm unchanged (pTeX might be an exception as Japanese requires a completely different approach, but probably Liang's algorithm is available there too).

This algorithm does not implement any hyphenation priorities! It is true that different numbers appear in the patterns, but they are there only to enable overwriting decisions made by other patterns: by the end of the day a hyphenation point is found if the maxium number is "odd".

If I remember the correctly how the patterns are derived, then this is a multi-pass generation: first one generates pattern with only values 0 and 1 from a set of hyphenated words. Then one applies these patterns and looks at all false hyphenations resulting from them. For those false positives patterns with value 2 (forbidden) are generated. Then one looks at what one is missing with the new set and generates patterns with value 3 (overwriting the 2 in places), then ... so in theory those numbers could go up as high as you like, but usually the results are pretty good already after a few iterations. Anyway, bottom line is the algorithm produces a simple yes/no for each place and no weights whatsoever.

To do so one would need to do some research on how this could be best captured, stored and used and as Joseph remarked this is one of the issues with TeX that I already high-lighted in 1990 in E-TeX: Guidelines for future TeX Extensions. Nothing has happened since then unfortunately.

David's suggestion of using two set of hyphenation patterns and first apply only the "preferred" ones and only then the "full" set is an interesting idea. It might work with LuaTeX but there is the complication text containing words in several languages. There is also the question if this is the best possible approach or if it would be better to use different demerits in the linebreaking (i.e., lower ones for hyphenation in preferred places). My assumption is that on the whole this would result in better solutions---but again, this really needs research which so far nobody has undertaken.

There is also the interesting question of what is more desirable as a hyphenation point. In my opinion very undesirable ones are thos where you are likely to pick up the wrong meaning from a word, e.g., in German

Nonnen-kloster  = nuns abbey
Nonnenklo-ster  = nuns toilet + <no word>

Spar-gelder     = (bank) savings
Spargel-der     = asparagus + the

But for the same reason I would disagree with the Oxford Dictionary's choice of "preferred" hyphenation point as

un-helpful      

means that the word "helpful" shows up at the beginning of the line/page and that may result in a reader picking up the opposite meaning from what was intended, if he starts for some reason to skim/read at this point.

  • 1
    Frank, thank you very much for an illuminating answer. Looks like a great deal of work will have to be done before we get there. – Brent.Longborough Aug 04 '12 at 21:28
  • 1
    @Brent.Longborough I don't really think that it would be that difficult, certainly not another PhD necessary to get this sorted out. But it does need some research to be usefully extended. – Frank Mittelbach Aug 04 '12 at 21:33
  • 2
    My favourite (German) example for undesirable hyphenation is Ur-instinkt (primary instinct) vs. Urin-stinkt (urine stinks). – lockstep Apr 07 '13 at 21:58