I've been trying to work out the mechanics of LaTeX3's regular expression system as implemented in l3regex, but am having some difficulty understanding how/why it is acting as it is.
If I use as an example the following code:
\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\cs_new:Npn \demo #1 {
\tl_new:N \l_demo
\tl_set:Nn \l_demo {#1}
\regex_replace_all:nnN {\_*?\_} {\emph{\1}} \l_demo
\tl_use:N \l_demo
}
\ExplSyntaxOff
\begin{document}
\demo{This is a _test_ document.}
\end{document}
The following text will be printed on the page:
This is a œmph– ̋testœmph– ̋ document.
But I would have expected to see the following:
This is a test document.
Similar results arise through the use of other regular expressions similar to the above pattern.
Would anyone be able to explain what is happening in this example, and how problems such as this might be fixed?
\_*?\_looks (lazily) for any sequence of underscores followed by an underscore. Thus the first underscore matches this patterns and it's changed into the string\emph{}where\erepresents the escape character (code 0x1B, which is ignored in print) and the braces are ordinary characters. – egreg Dec 28 '12 at 14:31