Understanding assertion matching in l3regex

Question

This question is a bit specific and beyond the official interface of the l3regex library, but I'm trying to understand how exactly the library works here.

Given the following example document:

\documentclass{article}
\usepackage{expl3}

\begin{document}
\ExplSyntaxOn

\cs_set:Npn \prepare_regex:N #1 {
    \cs_set_nopar:Npn \l_tmpa_regex {
        \__regex_branch:n {
            \__regex_assertion:Nn \c_true_bool { \char`\^ \__regex_anchor:N \l__regex_min_pos_int }
            \__regex_class:NnnnN  \c_true_bool {\__regex_item_caseful_equal:n {97}}{1}{-1} #1
            \__regex_assertion:Nn \c_true_bool { X \__regex_break_true:w }
            \__regex_class:NnnnN  \c_true_bool { \use:c{__regex_prop_.:} }{0}{-1} \c_false_bool
            \__regex_assertion:Nn \c_true_bool { \char`\$ \__regex_anchor:N \l__regex_max_pos_int }
        }
    }
}

\cs_set:Npn \test_match:n #1 {
    #1:~
    \regex_extract_once:NnNTF \l_tmpa_regex {#1} \l_tmpa_seq {}{}
    \par
}
\ttfamily

greedy~match: \par
\prepare_regex:N \c_false_bool
\test_match:n { ab }
\test_match:n { aab }
\test_match:n { aaab }
\test_match:n { aaaa }

\medskip
lazy~match: \par
\prepare_regex:N \c_true_bool
\test_match:n { ab }
\test_match:n { aab }
\test_match:n { aaab }
\test_match:n { aaaa }

\ExplSyntaxOff
\end{document}

\prepare_regex:N just assigns a regex to \l_tmpa_regex that is equivalent to the compiled form of ^a+.*$ or ^a+?.*$, depending on the passed parameter (greedy or lazy matching). Additionally, an extra assertion that always succeeds was added before the .* part. To see the order in that assertions are executed internally, I added some characters that will be output when the matching happens. The output is

There are several things here, I don't understand:

Why are all three assertions matched over and over again, when the ^ assertion, in my opinion, should only be executed at the beginning, and $ only at the end of the matching process. Perhaps $ would be executed several times to determine the end of the match, but ^ definitely not.
Why isn't the X assertion executed only once per matching when the "boundary" from the series of as is crossed to the bs? It would make some kind of sense, if backtracking (or whatever equivalent l3regex uses for that) would happen, but that's not the case, neither in the greedy nor in the lazy version.
Why is there a single ^$ at the end of all matchings? This makes no sense at all to me, because the regex doesn't even allow ^ followed by $ without an X assertion in between.
The lazy matching adds an X for each additional a in the string, but strangely not for the last one. Why?

Luckily l3regex provides some debug output that lets us have a closer look at what's going on. If we load expl3 with [enable-debug] and use the following snippet with my above example

\prepare_regex:N \c_false_bool
\int_set:Nn \g__regex_trace_regex_int { 100 }
% Recall this tries to match "^a+.*$" greedily
\test_match:n { ab }
\int_set:Nn \g__regex_trace_regex_int { 0 }

we get a debug output similar to the following one (also note the \typeouts added to the regex):

Trace: entering \__regex_build:N
Trace: regex new state L=0-> R=0-> M=1-> 2
Trace: regex new state L=0-> R=1-> M=2-> 3
Trace: entering \__regex_group_aux:nnnnN
Trace: regex new state L=1-> R=2-> M=3-> 4
Trace: entering \__regex_branch:n
Trace: regex new state L=2-> R=3-> M=4-> 5
Trace: regex new state L=2-> R=4-> M=5-> 6
Trace: regex new state L=4-> R=5-> M=6-> 7
Trace: regex new state L=5-> R=6-> M=7-> 8
Trace: regex new state L=6-> R=7-> M=8-> 9
Trace: regex new state L=7-> R=8-> M=9-> 10
Trace: regex new state L=8-> R=9-> M=10-> 11
Trace: leaving \__regex_branch:n
Trace: regex new state L=2-> R=3-> M=11-> 12
Trace: leaving \__regex_group_aux:nnnnN
Trace: \toks1={\__regex_action_start_wildcard: }
Trace: \toks2={\__regex_action_submatch:n {0<}\__regex_action_free:n {2}}
Trace: \toks3={\__regex_action_submatch:n {0>}\__regex_action_free:n {8}}
Trace: \toks4={\typeout {^}\char `\^\__regex_anchor:N \l__regex_min_pos_int
               \__regex_break_point:TF {\__regex_action_free:n {1}}{}}
Trace: \toks5={\__regex_item_caseful_equal:n {97}
               \__regex_break_point:TF {\__regex_action_cost:n {1}}{}}
Trace: \toks6={\__regex_action_free:n {-1}\__regex_action_free:n {1}}
Trace: \toks7={\typeout {X}X\__regex_break_true:w
               \__regex_break_point:TF {\__regex_action_free:n {1}}{}}
Trace: \toks8={\use:c {__regex_prop_.:}
               \__regex_break_point:TF {\__regex_action_cost:n {0}}{}
               \__regex_action_free:n {1}}
Trace: \toks9={\typeout {$}\char `\$\__regex_anchor:N \l__regex_max_pos_int
               \__regex_break_point:TF {\__regex_action_free:n {1}}{}}
Trace: \toks10={\__regex_action_free:n {-7}}
Trace: \toks11={\__regex_action_success: }

Trace: state 1
Trace: state 2
Trace: state 4
^
Trace: state 5
Trace: state 6
Trace: state 5
Trace: state 7
X
Trace: state 8
Trace: state 9
$                <-- up to this point the output makes perfect sense to me
Trace: state 1
Trace: state 2
Trace: state 4
^
Trace: state 8
Trace: state 9
$
Trace: state 10
Trace: state 3
Trace: state 11

I do not understand how to parse the start or the output with those L, R and M state mappings. The trace of the NFA execution makes sense to me at the beginning. First it runs up to state 6 where it (because of the greedy matching) tries to go back to state 5 to match another a, which fails and thus brings us to state 9 where the end of the string has been reached.

Now comes the thing I don't understand. State 9 ends with

\__regex_break_point:TF {\__regex_action_free:n {1}}{}

meaning "if the assertion matches, go one state further", i.e. state 10, then 3, then 11 and stop. But stangely it jumps from state 9 back to state 1, checks the ^ assertion again, skips the a matching (!), checks $ again, and only then goes into the final phase.

I'm sure this is just due to my lack of understanding how the regex matching works in detail. But to me this seems at least inefficient, if not wrong, because the regex is defined to find a greedy match and should be able to do that in a single pass. I'm still puzzled here.

Understanding assertion matching in l3regex

0 Answers0