Automatically making spaces (preferably also hyphens) after 3-or-less-chars words non-breaking

Question

In Polish and Czech typography short words should never be typeset at the end of a line, hyphenating a longer word is preferred instead. Sadly typing a NB space after every short word adds a lot of work (they are quite common). A simple general rule to apply would be automatically converting every space after a word 3 or less letters long to a NB space. (Not to mention the fact such a solution simply looks more elegant in any Latin script language).

In fact, for biological or chemical texts the rule should also extend to hyphens - a hyphen immediately after a 3 or less letter long word should be non-breaking, so e.g. an element symbol before a name of a chemical compound or a Greek letter in front of a protein name is not separated from what follows.

All existing solutions to the problem involve either a dictionary of words to insert a NB space after or cover only single letter words. This however does not work for many technical texts, because the rule should also take numbers into account (in a properly typeset text unit symbol is never separated from number it belongs to for example). So far I found no way to extend the code to cover any group of three characters.

(no MWE because IMO the matter is too general to make any use of one)

EDIT: upon some consideration I think this question could be divided in two parts. First would be setting a list of non-breaking strings, which should include explicit hyphen (-), dash (–), pause (—), colon (:) and colon surrounded by spaces ( : ) as all these character can appear as parts of a word in a chemical context. Second would be detecting words made of less than 4 letter-like characters (must include letters outside basic Latin script, e.g. Greek, and digits) and converting spaces behind them to non-breaking spaces, what I think could be done with a RegEx as a last resort.

https://tex.stackexchange.com/questions/554760/apply-lefthyphenmin-to-parts-of-a-word-spelled-with-hyphens — corvus_192, Oct 06 '20 at 16:25
That is LuaLaTeX only and covers just hyphens. Besides it's also just 1-char what's the main issue with all solutions I tried. — Paweł Małecki, Oct 06 '20 at 16:31
you can't reasonably do that in TeX but should be a trivial edit in the text editor you use to write the source so I'd recommend that. — David Carlisle, Oct 06 '20 at 17:55
Yeah, I was afraid of someone suggesting a RegEx parsing of the input... The problem is a RegEx like that must avoid tinkering with TeX functions, it should parse just the content text and leave all code parts alone. If anyone else tries RegEx'ing the input: don't forget your .bib file! — Paweł Małecki, Oct 06 '20 at 18:02
Welcome to TeX.SE. Does your earlier comment imply that you're not interested in a LuaLaTeX-based solution? Please clarify. — Mico, Oct 06 '20 at 18:26
I'd prefer a non-LuaLaTeX solution. The document I'm editing is based on an old template created with pdfLaTeX in mind and while it could probably work with LuaLaTeX just fine it may result in some misalignment I'd prefer to avoid. However, I understand that with all the novelties LuaLaTeX brings it may be much easier to achieve with it. The main problem with the solution suggested above is it does not deal with 2-3 character words and spaces, only with single (letter) characters in front of a hyphen. — Paweł Małecki, Oct 06 '20 at 18:51
"looks more elegant in any Latin script language" I don't think it would work for English, there are simply too many short words and it would prevent too many line breaks, looking at your question there are breaks after the, to, be for example. — David Carlisle, Oct 06 '20 at 19:41
That's why a solution more flexible than a RegEx would be better, creating a "high breaking cost" space instead of a plain non-breaking space would be optimal. Even in Polish a sequence of such short words can occur, especially if there are numbers involved. — Paweł Małecki, Oct 06 '20 at 21:05
Related: How to avoid line breaks that result in short words at line edges? (from July 2017). I haven't compared the two LuaTeX-based solutions but maybe there's something of value there too. — ShreevatsaR, Jul 05 '21 at 22:45

score 1 · Answer 1 · answered Oct 07 '20 at 16:04

So I ended up moving the project to LuaLaTeX and writing pre-linebreak filters. File is a bit lengthy because it requires creating a Unicode/UTF-8 transition database (default Lua has no libraries to get category of a Unicode glyph which is necessary to provide cross-language support). https://gitlab.com/PawelMalecki/pawelualatex/-/blob/master/linebreak_lib.lua

Automatically making spaces (preferably also hyphens) after 3-or-less-chars words non-breaking

1 Answers1