I'm working on a mini-parser that takes free user input and interprets certain inputs as commands. For example, the parser interprets + as \oplus or [ as "start a pre-configured array with the bracket as a delimiter". The parser would ultimately enable convenient inserting of a certain kind of data-structure used in linguistics (called AVM), for which there is currently no package on CTAN.
The parser is currently based on looping through an input token list (wit \tl_map_inline:nn). But looping through spaces and control sequences from the user input gives me a headache. For example, the user input could contain:
Hello World
\textit{Hello World}
Since \tl_map_inline:nn loops over the items of the token list, the outputs will be come "HelloWorld" and "HelloWorld".
Of course, the protected input
Hello{~}World
{\textit{Hello World}}
will give the desired result, but users are unlikely to type their input in that way. Also, the two inputs above really are quite different: Hello{~}World is a token list with 11 items, but {\textit{Hello World}} has just 1. In the mapping, I'd like to parse the contents of a command like \textit as individual letters, not a single token (because its argument could contain characters that the parser should be sensitive to, like the + mentioned above).
That got me thinking that maybe there is a better way to implement the parser than using a token list and a mapping on it. If the token list is the best way, then what methods are there to:
- prepare the user input so that spaces will be kept as items? (maybe replace them with
~, but how?) - properly forward control sequences to the output and be able to access the tokens their arguments are made of?
In this answer, @egreg suggested receiving the user input as a sequence, splitting that at each space into token lists, and then parsing the token lists (with a space added to the output after every tl). Can this approach be applied to carry along commands in the example of \textit?
(Is it true that the difference between item and token is at the core of the issue?)
Here's the bare skeleton of my approach (the actual one also contains a mode switch so that the user can disable replacement of, e.g. [ so that one can still enter commands with optional arguments in the scope of the parser):
\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn
\NewDocumentCommand{\parse}{+m}{
\avm_parse:n { #1 }
}
\tl_new:N \l_avm_output_tl
\cs_new:Nn \avm_parse:n {
\tl_clear:N \l_avm_output_tl
\tl_map_inline:nn {#1} {
\tl_put_right:Nn \l_avm_output_tl {##1}
}
\tl_use:N \l_avm_output_tl
}
\ExplSyntaxOff
\begin{document}
\noindent
\parse{Hello World}\\
\parse{\textit{Hello World}}\\
\parse{Hello{~}World}\\
\parse{{\textit{Hello{~}World}}}
\end{document}



expl3? If you're parsing token after token, maybe a letter-parser like the one in\usepgfmodule{parser}can be used. – Skillmon Dec 22 '19 at 22:40[and others active would make the command impossible to use in some places like\footnoteand in trees from theforestpackage – Felix Emanuel Dec 23 '19 at 08:00[,(, etc. from the user input are balanced. I will check out thepgfparser. – Felix Emanuel Dec 23 '19 at 08:03\scantokensis available these days) but as you have given no examples it's hard to say. – David Carlisle Dec 23 '19 at 09:01tokcyclespackage is sufficient -- see answer – user202729 Nov 11 '21 at 03:49