3

I am looking for a way to replace all s with ſ in my LuaTeX document (preferably plain LuaTeX or OpTeX) unless:

  1. it is preceded by another s,
  2. it is at the end of a word.

So it is similar to rule based replacement of s with ſ (long s) but with fewer and simpler rules and using plain.

I tried the following code:

\fontfam[LM Fonts]

\directlua{% function replace_s_with_long_s(input) return input:gsub("s", "ſ"):gsub("ſſ", "ß") end callback.add_to_callback("process_input_buffer", replace_s_with_long_s)}

\lipsum[1]

\bye

running the optex long-s.tex command but there are two issues with it:

  1. the s in lipsum got replaced (I want the substitution to affect only the text that is going to be typeset, not commands)
  2. I don't know how you can match the last letter of a word in Lua (for rule 2).

As a reference, the text

pour qui sont ces serpents qui sifflent sur vos têtes

should become

pour qui ſont ces ſerpents qui ſifflent ſur vos têtes

Ideally, the solution should not depend on the specific font used in the document.

  • @MarcelKrüger Currently I am using \fontfam[LM Fonts] (this is from OpTeX), but I am willing to use Garamond in the future –  Sep 17 '23 at 21:41
  • @MarcelKrüger The only error I get when running "optex long-s.tex" is ! Undefined control sequence. l.9 \lipſum[1], so it does not fail for unrelated reasons? –  Sep 17 '23 at 21:52

1 Answers1

5

It leaves some edge cases, but you can start with

\fontfam[LM Fonts]

\directlua{ --[[ A list of characters which indicate a word break ]] local non_word_chars = { [utf8.codepoint'.'] = true, [utf8.codepoint','] = true, [utf8.codepoint';'] = true, --[[ Add more here as appropriate ]] } --[[ A list of node types which indicate a word break ]] local non_word_ids = { [node.id'glue'] = true, [node.id'rule'] = true, --[[ Add more here as appropriate ]] } local s = utf8.codepoint's' local long_s = utf8.codepoint'ſ' function replace_s_with_long_s(head) local after_s = false for n in node.traverse(head) do local char, id = node.is_char(n) if char == s then local after = n.next local after_char, after_id = node.is_char(n.next) local is_end_of_word = after == nil or non_word_chars[after_char] or non_word_ids[after_id] if not (after_s or is_end_of_word) then n.char = long_s end after_s = true elseif char or non_word_ids[id] then after_s = false end end return true end callback.add_to_callback("pre_shaping_filter", replace_s_with_long_s)}

\lipsum[1]

\bye

The basic idea: Avoid process_input_buffer since otherwise you are transforming the input and not the typeset text, leading to issues like you had with \lipſum. Instead you can iterate over node lists using e.g. the pre_shaping_filter callback which runs directly before font processing is done. Then you just have to find character nodes with the .char field set to the Unicode codepoint of s. You can track if you already are just after a replacement with a simple boolean, but identifying the end of a work is more tricky: This code uses a simple heuristic using the following node type or character, but the lists should probably be extended and depending on the language the end of a word might be more complicated anyway.

Three of the remaining issues:

  1. What is the end of a word? https://www.unicode.org/reports/tr29/#Word_Boundaries is the standard algorithm to determine that in Unicode, but implementing that is somewhat involved.
  2. How should combining marks be handled? E.g. should an ś also be replaced. Currently it depends on how it was input (precomposed or decomposed), but they probably should be handled consistently.
  3. s inside of explicit \discretionary is ignored.
  • Thank you for your detailed explanation, the third issue is the only one that could be problematic to my use case, I will take it into account, thanks again –  Sep 17 '23 at 22:30