1

I hope I can communicate this clearly. I've been trying to pass an argument to a Lua function that contains LaTeX commands. Normally this works fine, but if Lua tries to match/sub %s within this argument it seems to treat the command as if it had no curly braces and operates only on the following single character. So, with the code in file 'new.lua' like this...

local function test(str)
    newstr = str:gsub("(-+)","X")
    tex.print(newstr)
end

return {test=test}

and the following in LaTeX:

\documentclass{article}
\usepackage{luacode}
\directlua{lua = require("new.lua")}
\newcommand{\test}[1]{\directlua{lua.test(\luastringN{#1})}}

\begin{document}

\test{aaa-\textbf{bbb}-ccc}

\end{document}

I obtain the desired result:

aaaXbbbXccc

but if I try the same with whitespace instead of -, like so:

(Lua)

local function test(str)
    newstr = str:gsub("(%s+)","X")
    tex.print(newstr)
end

return {test=test}

(LaTeX)

\documentclass{article}
\usepackage{luacode}
\directlua{lua = require("new.lua")}
\newcommand{\test}[1]{\directlua{lua.test(\luastringN{#1})}}

\begin{document}

\test{aaa \textbf{bbb} ccc}

\end{document}

I get the incorrect output

aaaXbbbXccc

and the following error:

! Undefined control sequence.
l.1 aaaX\textbfX
              {bbb}Xccc
l.8 \test{aaa \textbf{bbb} ccc}

In trying to figure this out, I noticed that replacing the X here with a non-alphabetic character makes the command operate only on that character before "adding" whitespace after the whole unit. So, with the following Lua...

local function test(str)
    newstr = str:gsub("(%s+)","1")
    tex.print(newstr)
end

return {test=test}

and the LaTeX the same as the previous example, I get:

aaa11bbb1ccc

Ideally, I would like the match/sub to treat LaTeX commands as if they involved no "hidden" whitespace, if that makes sense; that is, I only want to match the whitespace that is present in the actual written LaTeX.

I know this issue stems for an improper understanding of the way TeX handles tokens, but I'm not sure how to rectify that, or understand it properly.

1 Answers1

2

The space in \textbf {bbb} is inserted when LaTeX passes the argument #1, so already before \directlua is involved. Therefore you cannot do anything to prevent this from the Lua side. To address this from the LaTeX side you would need to use some tricks with category codes etc. which is not very pretty.

As a workaround you can remove the inserted space from Lua, replacing the pattern

backslash alphabetic characters [possible space] [curly or square open bracket]

with the same sequence but without the space. Commands with optional arguments are tokenized as \command [optional argument]{normal argument}, so the space is always directly after the command.

MWE:

local function test(str)
    --texio.write(str)
    newstr = str:gsub("(\\%a+)%s-([{[])","%1%2")
    --texio.write(newstr)
    newstr = newstr:gsub("(%s+)","1")
    tex.print(newstr)
end

return {test=test}
\documentclass{article}
\usepackage{luacode}
\directlua{lua = require("new.lua")}
\newcommand{\test}[1]{\directlua{lua.test(\luastringN{#1})}}

\begin{document}
\test{aaa \textbf{bbb} ccc}

\end{document}

Result:

enter image description here

Marijn
  • 37,699
  • Would you be kind to add reference to this: "The space in \textbf {bbb} is inserted when LaTeX passes the argument #1". Is it in TeX book? I am working on some tex-source converting to different format, and this behaviour puzzled me. – Tomáš Kruliš May 17 '20 at 13:57
  • @Marijn That did the trick! Thank you very much. – Niko Dimitrioğlu May 17 '20 at 14:10
  • @TomášKruliš I haven't found a definitive reference, but for example in The TeXbook p48 it is remarked that "For example, when the arguments to a macro are first scanned, they are placed into a token list" and on p212 "When a macro is expanded, TEX first determines its arguments (if any), as explained earlier in this chapter. Each argument is a token list;". An example of inserting a space is given on p382 Notice that a space will be inserted after the control word \it, but no space might actually have occurred there in the argument to \verbatim; such information has been irretrievably lost." – Marijn May 17 '20 at 15:04
  • @TomášKruliš And a direct quote by Joseph Wright (which can be considered authoritative): "A bit more on the addition of spaces by \detokenize, and indeed more generally. Whenever TeX writes something that can be one or more tokens as a 'string', it always inserts a space after each 'control word' (escape character followed by one or more 'letters') to avoid confusion." (https://tex.stackexchange.com/a/20062/) – Marijn May 17 '20 at 15:05
  • 1
    @TomášKruliš So to be a bit more precise, as far as I understand it: when the argument is passed it becomes a token list, which means that \textbf becomes a single token, and macro tokens get a space appended at the end, which you do not notice when the macro is used in the document in the normal way, but you do notice when the token is treated in a string-like way (for example \typeout{#1}, or with \detokenize or when processed with \luastringN). – Marijn May 17 '20 at 15:12
  • +1. To give the code a more Lua-y "feel", I'd probably replace "(\\[A-Za-z]+) ([{[])" with "(\\%a+)%s-([{[])"; oberve the use of the "magic" token %a in lieu of [A-Za-z]. I'd also replace the single space character with %s-. That way, the function can can continue to be used if it were assigned to the process_input_buffer callback; at that very early stage of processing LaTeX will not yet have had a chance to replace \textbf{bbb} with \textbf {bbb}. – Mico May 17 '20 at 18:00
  • 1
    @Mico thanks for the suggestion, I edited it in. – Marijn May 17 '20 at 19:57