Regular expression weirdness

Question

In answering Automatically put certain inputs (e.g. punctuation marks) outside of the environment/command I wrote something very similar to this:

\documentclass{article}
\usepackage{xparse}
\newcommand\Any[1]{``\textbf{#1}''\space}% a dummy command
\ExplSyntaxOn
\seq_new:N \l_word_seq % define a new sequence
\NewDocumentCommand\IterateOverPunctutation{ m }{
  % apply "function" #2 to the "words" in #1 between the punctuation  characters
    \regex_split:nnN { ([\(\)\.,;\:\s]+) } { #1 } \l_word_seq% split the sequence
    \cs_set:Nn \l_map_two:n {
       \regex_match:nnTF{ [\(\)\.,;\:\s]+ }{##1}
            {##1}% matches a punctuation character
            {\Any{##1}}% apply \Any to ##1
    }
    \seq_map_function:NN \l_word_seq \l_map_two:n
}
\ExplSyntaxOff
\begin{document}

\IterateOverPunctutation{A, (B: C. D)}

\IterateOverPunctutation{A), E,  G  H(;;,) (B: C. D)}

\IterateOverPunctutation{abc,a:b::def:f}

\end{document}

This code produces:

Can anyone explain to me the empty double quotes appear at the end of the first two lines?

What is happening is that an empty string is being passed through to

\regex_match:nnTF{ [\(\)\.,;\:\s]+ }{##1}{##1}{\Any{##1}}

As the empty string does not match the regular expression it is then printed as \Any{}. My question really is why is \regex_match:nnTF putting an empty string into the sequence \l_word_seq?

If we change the match to

\regex_match:nnTF{ ^[\(\)\.,;\:\s]*$ }{##1}{##1}{\Any{##1}}

then we get the output that I expected:

because the new regular matches the "punctuation", the empty string and none of the "words". So it solves the problem but I still don't understand why the empty string can appear in the sequence returned by \regex_split:nnN.

score 7 · Accepted Answer · answered Oct 02 '17 at 20:49

There is an empty item at the end of the sequence, if the token list ends with punctuation. You can remove it by checking whether the last item is empty.

\documentclass{article}
\usepackage{xparse}

\newcommand\Any[1]{``\textbf{#1}''\space}% a dummy command

\ExplSyntaxOn

\seq_new:N \l_word_seq % define a new sequence

\NewDocumentCommand\IterateOverPunctuation{ m }
 {
  % apply "function" #2 to the "words" in #1 between the punctuation  characters
    \regex_split:nnN { ([().,;:\s]{1,}) } { #1 } \l_word_seq% split the sequence
    \tl_if_empty:xT { \seq_item:Nn \l_word_seq { -1 } }
      { \seq_pop_right:NN \l_word_seq \l_tmpa_tl }
    \seq_map_function:NN \l_word_seq \__map_two:n
 }
\cs_generate_variant:Nn \tl_if_empty:nT { x }

\cs_new_protected:Nn \__map_two:n
 {
   \regex_match:nnTF{ [().,;:\s] }{#1}
     {#1}% matches a punctuation character
     {\Any{#1}}% apply \Any to #1
 }

\ExplSyntaxOff

\begin{document}

\IterateOverPunctuation{AA, (B: C. D)}

\IterateOverPunctuation{A), E,  G  H(;;,) (B: C. D)}

\IterateOverPunctuation{abc,a:b::def:f}

\end{document}

I made a few changes, in particular you don't need to redefine \__word_map:n at each call (and it should be protected).

This is really a more precise description of what is happening rather than why it is happening. As far as I can see the manual does not mention this special feature of \regex_split:nnN adding an empty item when there is terminal punctuation and this is not a common feature of regular expressions in other languages. Is it intentional? Is it a bug? (Thanks for telling me about \tl_if_empty:xT, although my use of \regex_match:nnTF to deal with the issue seems more efficient in this instance.) — , Oct 03 '17 at 04:56
Ah, I take it back, it does happen in other languages! OK, I'll accept this then. This said, both latex and python, at least, are inconsistent because initial punctuation should lead to an empty item at the start of the sequence but neither of them do this. — , Oct 03 '17 at 05:22
@Andrew I don't know about Python, but consistency is the last thing I expect from Perl. ;-) — egreg, Oct 03 '17 at 06:31

Regular expression weirdness

1 Answers1

Linked