In LaTeX3, what is the best data type for storing, parsing and outputting free user input?

Question

I'm working on a mini-parser that takes free user input and interprets certain inputs as commands. For example, the parser interprets + as \oplus or [ as "start a pre-configured array with the bracket as a delimiter". The parser would ultimately enable convenient inserting of a certain kind of data-structure used in linguistics (called AVM), for which there is currently no package on CTAN.

The parser is currently based on looping through an input token list (wit \tl_map_inline:nn). But looping through spaces and control sequences from the user input gives me a headache. For example, the user input could contain:

Hello World
\textit{Hello World}

Since \tl_map_inline:nn loops over the items of the token list, the outputs will be come "HelloWorld" and "HelloWorld".

Of course, the protected input

Hello{~}World
{\textit{Hello World}}

will give the desired result, but users are unlikely to type their input in that way. Also, the two inputs above really are quite different: Hello{~}World is a token list with 11 items, but {\textit{Hello World}} has just 1. In the mapping, I'd like to parse the contents of a command like \textit as individual letters, not a single token (because its argument could contain characters that the parser should be sensitive to, like the + mentioned above).

That got me thinking that maybe there is a better way to implement the parser than using a token list and a mapping on it. If the token list is the best way, then what methods are there to:

prepare the user input so that spaces will be kept as items? (maybe replace them with ~, but how?)
properly forward control sequences to the output and be able to access the tokens their arguments are made of?

In this answer, @egreg suggested receiving the user input as a sequence, splitting that at each space into token lists, and then parsing the token lists (with a space added to the output after every tl). Can this approach be applied to carry along commands in the example of \textit?

(Is it true that the difference between item and token is at the core of the issue?)

Here's the bare skeleton of my approach (the actual one also contains a mode switch so that the user can disable replacement of, e.g. [ so that one can still enter commands with optional arguments in the scope of the parser):

\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn

\NewDocumentCommand{\parse}{+m}{    
    \avm_parse:n { #1 }
}
\tl_new:N \l_avm_output_tl
\cs_new:Nn \avm_parse:n {
    \tl_clear:N \l_avm_output_tl
    \tl_map_inline:nn {#1} {
        \tl_put_right:Nn \l_avm_output_tl {##1}
    }
        \tl_use:N \l_avm_output_tl
}
\ExplSyntaxOff  
\begin{document}
    \noindent
    \parse{Hello World}\\
    \parse{\textit{Hello World}}\\
    \parse{Hello{~}World}\\ 
    \parse{{\textit{Hello{~}World}}}
\end{document}

do you need to parse, rather than just making + and [ active and giving them suitable definitions? — David Carlisle, Dec 22 '19 at 20:58
Does the code have to use expl3? If you're parsing token after token, maybe a letter-parser like the one in \usepgfmodule{parser} can be used. — Skillmon, Dec 22 '19 at 22:40
@DavidCarlisle if I understand it correctly making [ and others active would make the command impossible to use in some places like \footnote and in trees from the forest package — Felix Emanuel, Dec 23 '19 at 08:00
@Skillmon My code beyond the MWE also uses some features from L3, e.g. a sequence as a stack to check whether the [, (, etc. from the user input are balanced. I will check out the pgf parser. — Felix Emanuel, Dec 23 '19 at 08:03
@FelixEmanuel not necessarily (\scantokens is available these days) but as you have given no examples it's hard to say. — David Carlisle, Dec 23 '19 at 09:01
In most cases the tokcycles package is sufficient -- see answer — user202729, Nov 11 '21 at 03:49

score 4 · Accepted Answer · answered Dec 23 '19 at 17:39

It's possible to parse the entire token list while preserving spaces and groups, but it's a lot more work than "normal" \tl_map_inline:nn parsing (see here for a more complete explanation). The issue is that \tl_map_inline:nn, as its documentation in interface3 says, does not iterate on tokens, but on items, and it removes outer braces. All of XX, X X, X{X}, and {XX}{XX} are lists of two items (in the last one, each item has two tokens). This space and brace removal happens at a low level of TeX because when it grabs an argument, all of \next X, \next{X} and \next <spaces>X grab X as argument: this is what defines an item.

To iterate on proper tokens, you need to specifically check for the case of spaces and braces. expl3 has three conditional functions, \tl_if_head_is_space:n(TF) to check whether the first token in the argument token list is a space, \tl_if_head_is_group:n(TF) for the case of a grouped list of tokens, and \tl_if_head_is_N_type:n(TF) for all other cases. This allows you to process the argument token list conditionally for spaces and groups, and thus preserve them in the process. This method is also used by expl3's \text_uppercase:n (former \tl_upper_case:n), \tl_count_tokens:n, \tl_reverse:n, and a few others.

Here I defined the \avm_parse:n function so that it iterates on each token of the argument, and in case the token is of N-type, it checks if the current token is a + or a [ (using \str_case:nnF) and calls the appropriate function to process each case. You can change the definition of \__avm_special_plus: and \__avm_special_lbrack:w to do what you expect (I made the + produce \ensuremath{\oplus} and [<tokens>] produce \textbf{<tokens>} (and also parse the <tokens> with \avm_parse:n).

Here's the code:

\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn
\NewExpandableDocumentCommand {\parse} {+m}
  { \avm_parse:n {#1} }
\cs_new:Npn \avm_parse:n #1
  { \exp_args:No \exp_not:o { \__avm_parse:n {#1} } }
\cs_new:Npn \__avm_parse:n #1
  {
    \exp:w
      \group_align_safe_begin:
        \__avm_parse_loop:w #1
          \q_recursion_tail \q_recursion_stop
        \__avm_result:n { }
  }
\cs_new:Npn \__avm_end:w \__avm_result:n #1
  {
      \group_align_safe_end:
    \exp_end:
    #1
  }
\cs_new:Npn \__avm_parse_loop:w #1 \q_recursion_stop
  {
    \tl_if_head_is_N_type:nTF {#1}
      { \__avm_N_type:N }
      {
        \tl_if_head_is_group:nTF {#1}
          { \__avm_group:nw }
          { \__avm_space:w }
      }
    #1 \q_recursion_stop
  }
\cs_new:Npn \__avm_N_type:N #1
  {
    \quark_if_recursion_tail_stop_do:Nn #1
      { \__avm_end:w }
    \__avm_parse_specials:N #1
  }
\cs_new:Npn \__avm_parse_specials:N #1
  {
    \str_case:nnF {#1}
      {
        { + }{ \__avm_special_plus: }
        { [ }{ \__avm_special_lbrack:w }
      }
      { \__avm_non_special:N #1 }
  }
\cs_new:Npn \__avm_group:nw #1
  { \exp_args:NNo \exp_args:No \__avm_group:n { \__avm_parse:n {#1} } }
\cs_new:Npn \__avm_group:n #1 { \__avm_add_result:nw { {#1} } }
\exp_last_unbraced:NNo
\cs_new:Npn \__avm_space:w \c_space_tl { \__avm_add_result:nw { ~ } }
\cs_new:Npn \__avm_add_result:nw #1 #2 \q_recursion_stop \__avm_result:n #3
  { \__avm_parse_loop:w #2 \q_recursion_stop \__avm_result:n {#3 #1} }
\cs_new:Npn \__avm_non_special:N #1 { \__avm_add_result:nw {#1} }
%
\cs_new:Npn \__avm_special_plus:
  { \__avm_add_result:nw { \ensuremath { \oplus } } }
\cs_new:Npn \__avm_special_lbrack:w #1 ]
  {
    \exp_args:No \__avm_add_result:nw
      { \exp:w \exp_args:NNf \exp_end: \textbf { \avm_parse:n {#1} } }
  }
\ExplSyntaxOff
\begin{document}
\noindent
\parse{Hello World}\\
\parse{\textit{Hello World}}\\
\parse{Hello{~}World}\\
\parse{{\textit{Hello{~}World}}}\\
\parse{Hello [World]}\\
\parse{\textit{Hell[o W]orld}}\\
\parse{Hell[o{~+}W]orld}\\
\end{document}

and the output of the examples:

Thank you, the recursion showcase is great! I'm having a little trouble understanding some of the \cs_new definitions. The particular confusion I have are (1) \__avm_result:n seems to be defined implicitly - but how? — Felix Emanuel, Jan 10 '20 at 11:18
A (2) confusion is the use of parameters in \cs_new. From the interface3 documentation, I thought that the <parameter> in \cs_new is always sth like #1 #2. In the definition of \__avm_end, is \__avm_result:n #1 to be read as the parameters, i.e. the parameter is actually the result of the function \__avm_result:n #1? — Felix Emanuel, Jan 10 '20 at 11:20
@FelixEmanuel (1) It's not defined: it's only used as a guard token to explicitly show where the result is stored during the loop, and you can remove it without changing the working of the code. In fact, you'd get a small (almost unmeasurable) performance boost if you removed the four occurrences of \__avm_result:n, at the cost of a perhaps less clear code. — Phelype Oleinik, Jan 10 '20 at 11:52
@FelixEmanuel (2) The <parameter> argument of \cs_new:Npn (and similar functions) is the same as the parameter text for TeX's \def: #1 through #9 denote parameters, and any other token acts as an argument delimiter. As an example, consider \def\test#1\foo{Result is (#1).}, and then use it as \test hello\foo and then as \test hello\bar (the second should raise an error). The tokens you put in the parameter text are not arguments of the macro, but they must be there when you use the macro (for example as in \__avm_parse_loop:w, which grabs everything up to \q_recursion_stop). — Phelype Oleinik, Jan 10 '20 at 11:58
it was great working on your code in the past few weeks, there is just one thing I can't understand. Assume there is a boolean \bool_new:N \l_bool and then, in \__avm_parse_loop:w as the very first item of that command's definition, we output to the log \bool_log:N \l_bool so as to check for the value at every iteration. That works fine, and the value is indeed output to the log at each iteration. However, I'm receiving a Missing number, treated as zero. error. Additionally, if compiled with PdfLaTeX, Gammas appear in the output. Do you know what might be causing this weird error? — Felix Emanuel, Mar 02 '20 at 17:43
@FelixEmanuel That's because the entire \parse thing is fully-expandable (expandable is an overused term, but here I mean "if you use \parse in \edef it will do the right thing and won't blow up"), powered by \exp:w (TeX's \romannumeral). In this expansion-only context there is a limited set of commands you can use. \bool_log:N is not expandable (not marked with ★ in interface3), so you can't use it there (effectively, its the same as \exp:w \bool_log:N \l_tmpa_bool)... — Phelype Oleinik, Mar 02 '20 at 18:38
@FelixEmanuel (cont'd) For debugging expandable commands I like an \eshow:n macro, defined as \cs_new:Npn \eshow:n #1 { \exp_after:wN \exp_after:wN \exp_after:wN \use_none_delimit_by_q_stop:w \use:n { \SHOW (#1) } \q_stop } (same principle as expl3's \msg_expandable_error:nn(nnnn)---make sure \SHOW is not defined). Though for boolean variables you will need something like \cs_new:Npn \eshowthe:N #1 { \exp_args:No \eshow:n { \tex_the:D #1 } }, then use it like \eshowthe:N \l_bool. Side-note: \l_bool is misnamed. A better name would be \l_felix_tmp_bool. — Phelype Oleinik, Mar 02 '20 at 18:39

score 1 · Answer 2 · answered Mar 26 '22 at 05:36

I find the recursion approach understandable, but somewhat harder to understand than the analysis function family; and it's not easy to preserve the charcode of {}.

_{Remark: In the simple cases, if a regex replacement suffices, just use that. Also it's faster.}

Alternative approaches include

use regex to temporarily replace all space with some placeholder, process, then replace back.
tokcycle or etl package.

An example of using \tl_analysis_map_inline _{needless to say, in real code don't redefine important LaTeX macros such as \f, \a etc.}:

\ExplSyntaxOn
\def \f #1 {
    \tl_build_begin:N \a
    \tl_analysis_map_inline:nn {#1} {
        \int_compare:nNnTF {##2} = {`[} {
            \tl_build_put_right:Nn \a {
                \iftrue \noexpand\textbf{ \else } \fi  % this will x-expand to '\textbf{'
            }
        } {
            \int_compare:nNnTF {##2} = {`]} {
                \tl_build_put_right:Nn \a {
                    \iffalse { \else } \fi  % this will x-expand to '}'
                }
            } {
                \int_compare:nNnTF {##2} = {`+} {
                    \tl_build_put_right:Nn \a { $\noexpand \oplus$ }
                } {
                    \tl_build_put_right:Nn \a {##1}
                }
            }
        }
    }
    \tl_build_end:N \a
    \tl_set:Nx \a {\a}
}
\ExplSyntaxOff
\f{a [Hello + world] b}
\a

The method to build partially-unbalanced intermediate token list is also demonstrated in the answer. (see explanation in my other answer here)

The output is , and the token list content after being processed is .

The disadvantage of this method however, is that it's necessary to manually count the number of open braces and close braces if you want to either

grab an item, or
only operate on the top level.

(while tokcycle package can do that automatically, but it ignores charcode of {})

In LaTeX3, what is the best data type for storing, parsing and outputting free user input?

2 Answers2

Linked