How does \verb detect spaces that shouldn't exist

Question

Consider the following MWE:

\documentclass{article}
\usepackage{listings}
\lstset{basicstyle=\ttfamily}
\begin{document}
\lstinline |asdf|asdf asdfasdf
\verb |asdf|asdf asdfasdf
\end{document}

My understanding of what is to expect here has always been the following (let \cmd stand for either \verb or \lstinline in the following):

When TeX first tokenized \cmd |, it gobbles the space following it, leaving only the token \cmd in its "mouth" (and | behind it in the input stream).
It then expands \cmd, which leads to a series of category code changes, basically making every otherwise special character other, followed by some macro that looks at the next token (in this case, |).
This macro then grabs everything up to the next occurrence of that token (being tokenized then), applies some formatting and changes the category codes back.

Notably, the space following \cmd is gobbled during that control sequence's tokenization, i.e. before any category codes are changed.

With this understanding, I would expect both of the lines above to typeset

asdfasdf asdfasdf

But I get the following output:

\lstinline behaves as expected, but \verb somehow knows about the space following it.

How?? To my knowledge, there shouldn't ever have been a space token behind the \verb token.

And, offering further confirmation of your observation, if you put two spaces after \verb, it is done and complete, before ever reaching the |. — Steven B. Segletes, Oct 09 '20 at 14:17
Can be worked around with \expandafter\verb |asdf|asdf asdfasdf. Presumably, in this case, the \expandafter will gobble the spaces in search of the next token, so that \verb subsequently no longer finds the space. — Steven B. Segletes, Oct 09 '20 at 14:18
@StevenB.Segletes only if the character after \verb is safe. \expandafter\verb {asdf{ would cause trouble. (Of course, not being able to just not type the space is not really a common problem, so this is more of an academic question.) — schtandard, Oct 09 '20 at 14:24
The point about "safe" is well taken. The \expandafter will cause the next token to be tokenized, I guess, which locks in its catcode. Since \verb is a game of catcodes, setting the catcode of { before \verb sees it would, logically, cause problems. — Steven B. Segletes, Oct 09 '20 at 14:31
I was a bit surprised to find that \csname verb\endcsname |asdf|asdf asdfasdf also will not gobble the space after \endcsname, unless I add an \expandafter before the \endcsname. — Steven B. Segletes, Oct 09 '20 at 14:36
Other "unsafe" characters for the \expandafter trick include }, %, and active characters, such as ~. — Steven B. Segletes, Oct 09 '20 at 14:42
Very good question. Note that “This macro then grabs everything up to the next occurrence of that token” is not really correct: there is no grabbing of the verbatim contents with a delimited argument. If you use, e.g., \tracingall with \verb |asdf| %, you'll see that \@sverb grabs one argument which is an explicit space token, probably coming from the space character following \verb. Why this space character hasn't been discarded when \verb was tokenized, I don't know. \@sverb makes the grabbed token \let-equivalent to \verb@egroup, which yields an \egroup matching... — frougon, Oct 09 '20 at 14:54
... the \bgroup in \verb. Tokens in-between are simply processed as catcode-12 tokens, except space tokens which are active in this context (this is due to the use of \@vobeyspaces by \@verb here, or by \@sverb when coming from \verb*). — frougon, Oct 09 '20 at 15:04
@StevenB.Segletes “the \expandafter will gobble the spaces in search of the next token”: no, the \expandafter actually hits the space token, not expanding it, but freezing its catcode 10, then the \@ifstar in \verb will ignore (catcode 10) space tokens as usual — Phelype Oleinik, Oct 09 '20 at 16:03
@PhelypeOleinik Thank you for both this clarification and your well explained answer (+1). — Steven B. Segletes, Oct 09 '20 at 16:04

Phelype Oleinik · Answer 1 · 2020-10-09T22:22:27.837

12

At the very beginning you said:

When TeX first tokenized \cmd |

but that's wrong. TeX is a well-behaved gentleman and doesn't get ahead of itself scanning a and a | before knowing what \cmd is supposed to do. As far as TeX is concerned, the space and the | and whatever other character could all mean the same thing, and could change in meaning, so pre-scanning would only cause confusion.

When TeX sees \cmd, the only “special” thing it does to blank spaces is to set state:=skip_blanks, so that when, say, typesetting, \TeX code will write , ignoring the spaces after the control sequence as usual. You can check for yourself with:

\def\test{\catcode`\ =12 \testx}
\def\testx{\futurelet\token\testy}
\def\testy{\show\token\afterassignment\testx\let\token = }
\test     x

and you'll see that it shows 5 the character before showing the letter x.

Now back to the problem at hand: update your LaTeX :-)

The old behaviour of \verb was to look at the next token, whichever it happened to be, and use that as a delimiter (given the exception of {). This has now been fixed for the 2020-10-01 LaTeX release (from LaTeX News Issue 32):

$Avoid problematic spaces after \verb If a user typed \verb␣!~!␣foo instead of \verb!~!␣foo by mistake, then surprisingly the result was “!~!foo” without any warning or error. What happened was that the ␣ became the argument delimiter due to the rather complex processing done by \verb to render verbatim. This has been fixed and spaces directly following the command \verb or \verb* are now ignored as elsewhere. (github issue 327)$

edited Oct 09 '20 at 22:22

answered Oct 09 '20 at 15:56

Phelype Oleinik

70,814

2

Nice (+1). But the common lore (e.g. this answer, but I remember many similar statements) has always been something along the lines of "the non-catcode11 character that was used to terminate the csname scan is returned to the input stream (as a character, still untokenised) unless it it is catcode 10 space character in which case it is discarded". I guess the skipping blanks state is more complicated than it looks. – campa Oct 09 '20 at 16:03
1

@campa It's TeX, of course it's more complicated than it looks :-) Yes, what you say is correct, but at a much lower level. In §354 (Scan a control sequence...), after seeing a catcode-0 character TeX will call Scan ahead in the buffer... (§356) looking at the following characters in the input stream, but not assigning category codes to any of them. The input buffer is a string, and TeX walks a pointer along that string until finding a non-letter character; when done, it backs up that pointer, tokenises the control sequence, and proceeds, with no memory of seeing the character after. – Phelype Oleinik Oct 09 '20 at 16:21
Oh, I see. The question is then: when are space (tokens? characters?) actually discarded? I'll have to click the "Ask Question" button... :-). – campa Oct 09 '20 at 16:24
@campa Within TeX-the-program itself, you have access to the entire line of input (even commented stuff) by looking at the buffer variable. Within the program you can do print(buffer(0)) to print the first character and print(buffer(limit)) to print the last one, as you would in Pascal or C or whatever, without no effect at all in the tokenisation process. But when reading a document, TeX is careful enough to look at the buffer in order, turn each character into the proper token giving them meaning, and executing them as appropriate, while still having all of it at hand in buffer. – Phelype Oleinik Oct 09 '20 at 16:31
Ah, I always thought the spaces were discarded at the moment of tokenization of the control word. – schtandard Oct 09 '20 at 20:02
Also, very nice timing with the recent change. – schtandard Oct 09 '20 at 20:03
Sadly I'm a too lousy programmer to understand this stuff... I should keep answering more or less trivial qiestions :-) – campa Oct 09 '20 at 20:10

frougon · Answer 2 · 2020-10-09T19:15:35.717

I believe what happens is as follows:

\verb is first tokenized (the space character, which has catcode 10 just before \verb is tokenized, marks the end of this control word but is not discarded).
TeX will go into state S, since \verb is a control word (control sequence whose name is made of “letters” only), but it doesn't skip blanks yet.
\verb is expanded and code from its expansion is executed. This code first gives spaces the catcode 12 (via \let\do\@makeother \dospecials), this is important.
A the end of \verb's replacement text, there is \@ifstar\@sverb\@verb. This \@ifstar looks ahead in the input, thus the state S kicks in. Since spaces have catcode 12 at this point, the space character following \verb is not skipped. It gets tokenized with catcode 12.
Since we used the no-star form of \verb and \@verb is defined as \def\@verb{\@vobeyspaces \frenchspacing \@sverb}, spaces are now made active, and \@sverb is expanded (so, the end delimiter will be a catcode-13 space, while the start delimiter was a catcode-12 space).
\@sverb grabs the catcode-12 space token as its only argument and defines active spaces to be \let-equal to \verb@egroup (if \verb* had been used, \@sverb would have done \@setupverbvisiblespace \@vobeyspaces too; thus, spaces end up active in all cases). This is how the verbatim text will end in non-erroneous conditions: \verb@egroup will yield \egroup, which will terminate the group started by \verb (there is a \bgroup in \verb's replacement text). Since the special catcode setup has been done locally inside this group, this terminates the special catcode setup.

Thus, the sentence from the question “This macro then grabs everything up to the next occurrence of that token” is not really correct: there is no grabbing of the verbatim contents as an argument. Tokens between the start and the end delimiters are simply processed as catcode-12 tokens, except space tokens which are always active at the end of \@sverb, as we've seen.

Note: as Phelype Oleinik pointed out, the behavior of \verb was changed in the LaTeX format from 2020-10-01. My comments here are based on LaTeX2e <2020-02-02> patch level 5.

How does \verb detect spaces that shouldn't exist

2 Answers2