It leaves some edge cases, but you can start with
\fontfam[LM Fonts]
\directlua{
--[[ A list of characters which indicate a word break ]]
local non_word_chars = {
[utf8.codepoint'.'] = true,
[utf8.codepoint','] = true,
[utf8.codepoint';'] = true,
--[[ Add more here as appropriate ]]
}
--[[ A list of node types which indicate a word break ]]
local non_word_ids = {
[node.id'glue'] = true,
[node.id'rule'] = true,
--[[ Add more here as appropriate ]]
}
local s = utf8.codepoint's'
local long_s = utf8.codepoint'ſ'
function replace_s_with_long_s(head)
local after_s = false
for n in node.traverse(head) do
local char, id = node.is_char(n)
if char == s then
local after = n.next
local after_char, after_id = node.is_char(n.next)
local is_end_of_word = after == nil or non_word_chars[after_char] or non_word_ids[after_id]
if not (after_s or is_end_of_word) then
n.char = long_s
end
after_s = true
elseif char or non_word_ids[id] then
after_s = false
end
end
return true
end
callback.add_to_callback("pre_shaping_filter", replace_s_with_long_s)}
\lipsum[1]
\bye
The basic idea: Avoid process_input_buffer since otherwise you are transforming the input and not the typeset text, leading to issues like you had with \lipſum. Instead you can iterate over node lists using e.g. the pre_shaping_filter callback which runs directly before font processing is done. Then you just have to find character nodes with the .char field set to the Unicode codepoint of s. You can track if you already are just after a replacement with a simple boolean, but identifying the end of a work is more tricky: This code uses a simple heuristic using the following node type or character, but the lists should probably be extended and depending on the language the end of a word might be more complicated anyway.
Three of the remaining issues:
- What is the end of a word? https://www.unicode.org/reports/tr29/#Word_Boundaries is the standard algorithm to determine that in Unicode, but implementing that is somewhat involved.
- How should combining marks be handled? E.g. should an
ś also be replaced. Currently it depends on how it was input (precomposed or decomposed), but they probably should be handled consistently.
s inside of explicit \discretionary is ignored.
! Undefined control sequence. l.9 \lipſum[1], so it does not fail for unrelated reasons? – Sep 17 '23 at 21:52