14

What is a new line for TeX in the following contexts:

  1. When reading from a file.
  2. When writing to a file.
  3. After having read a % character.
  4. In a \scantokens.

I am asking in particular because the following code only typesets A:

\documentclass{minimal}
\begin{document}
\catcode`\%=12
\def\foo{\scantokens{A%

B}} \show\foo \catcode`%=14 \foo \end{document}

So my main question is: how does % know where to stop gobbling characters?

EDIT: Adding the two lines

\catcode`\^^M=12 
\newlinechar`\^^M

before the definition of \foo is instructive: then the definition actually contains new-lines, and the comment stops gobbling where we expect.

EDIT2: pdflatex sets \newlinechar`\^^J and \endlinechar`\^^M (see Harald's concise answer below for what these are).

user202729
  • 7,143
  • Re you edit: Yep. The e-TeX manual says, "In particular every occurrence of the current newline character is interpreted as start of a new line, and input characters will be converted into tokens as usual." So that takes care of \scantokens. Changing the catcode deals with the tokenization of the replacement text of \foo. – TH. Jan 17 '11 at 13:28
  • @TH.: thank you for your longer explanation: I accepted Harald's answer rather than yours because I think that it will be more directly useful to other people wanting a quick answer. The two complement each other quite well. – Bruno Le Floch Jan 17 '11 at 16:40
  • See also tex core - Use of \everyeof and \endlinechar with \scantokens - TeX - LaTeX Stack Exchange for scantokens (most of the time scantokens is the most common case you'll encounter the difference between ^^J and ^^M) – user202729 Dec 27 '21 at 03:01

3 Answers3

10

So you asked several questions, but let me first answer your main question.

What happens here is that when the definition for \foo is being parsed, it's identical to

\def\foo{\scantokens{A%\par B}}

Now when \scantokens executes, it's as if it read the line

A%\par B

from a file with the current catcodes in effect. Since you reset % to be a comment, the \par B is ignored.

For your other questions, individual installations of TeX can determine what is treated as a new line when reading and writing files. If I recall correctly, \n, \r, \r\n, and \n\r are treated as a newline for the purposes of input, at least that's what I recall from reading through the source for pdfTeX recently. For output, (i.e., writing files), I suspect it uses \n on *NIX and \r\n on Windows, but I haven't verified this.

After TeX reads a line of text from an input file—and before it begins to tokenize it—it removes all trailing space characters including \r and \n and appends the \endlinechar character which is normally ^^M (i.e., \r). This happens regardless of there being a % character in the line. When TeX encounters a % character in its input (note that there is no comment token), it ignores the rest of the line, including the trailing \r.

As far as I know, ^^J (i.e., \n) is not special in most contexts except it is often used as the \newlinechar for use in \write.

I forgot \scantokens. It is really treated like lines of input from a file, including the \endlinechar at the end of each. For a simple example of this, try

\endlinechar`X %
\scantokens{A}%
\bye%

(The space after the X is necessary because TeX looks for an optional space there.) Every line ends with a percent and yet the output shows AX because \scantokens has inserted that character.

TH.
  • 62,639
  • Actually, TeX does not insert the \endlinechar if there was a comment character on the line. Neither does \scantokens, if there is one in its argument. – Harald Hanche-Olsen Jan 14 '11 at 21:25
  • @Harald: TeX can't know if there is a comment in the line until it tokenizes the line. Consider \def\foo{\catcodeX14 }and then on the next line\foo This has a comment. X this isn't shown. You can see this intex.web. Look forelse buffer[limit]:=end_line_char;` – TH. Jan 14 '11 at 22:06
  • @TH: Yes, but how does that support your claim that TeX inserts an \endlinechar character at the end of a line even if there is a comment on the line? A simple experiment reveals that this is not so. – Harald Hanche-Olsen Jan 14 '11 at 22:22
  • @Harald: It supports my claim because the code shows that it happens! Of course it doesn't appear in the output because TeX throws away the rest of the buffer, including the \endlinechar, when it encounters the comment. Maybe changing my example will make this more clear: \def\foo{\catcode%12 }\endlinechar"5aand on the next line:\foo %This line contains no comment.If TeX did not insert the\endlinechar` in lines that had comment characters at the time the line was read, there would be no Z at the end of the output line. – TH. Jan 14 '11 at 22:34
  • Bah. Markdown ate my first backtick. Make that \def\foo{\catcode"25 12 }\endlinechar"5a – TH. Jan 14 '11 at 22:36
  • @TH: Ah, now I see what you are saying. The point being that the whole line is read, and the \endlinechar is appended, before any tokenization even begins. – Harald Hanche-Olsen Jan 14 '11 at 22:43
  • @Harald: I see the confusion. Hopefully that edit clarified it a bit. Thanks! – TH. Jan 14 '11 at 23:24
  • @TH.: +1, great explanation. For markdown in comments: If your backtick (in inline code) is followed by a letter, then it works, but here it's followed by a backslash, so you need to escape it with a backslash: \def\foo{\catcode\%12 }\endlinechar"5a` – Hendrik Vogt Jan 15 '11 at 10:10
  • @TH: Excellent. It is quite amusing to assign a letter to \endlinechar and watch things break as \foo at the end of the line becomes \fooX, which is undefined. But in the resulting error message you can't see that. – Harald Hanche-Olsen Jan 15 '11 at 10:15
  • @Harald: More aggravating, I'd say. It took a bit to figure out why it was complaining about \bye. The error message could be better. – TH. Jan 15 '11 at 20:58
7
  1. I think that what constitutes the end of line upon reading from a file is hardcoded according to whatever operating system you are running on. And that end of line is represented by the character whose number is \endlinechar.
  2. When writing, the character whose number is \newlinechar will trigger the end of a line. Again, the exact result in the output file is hard coded, depending on your operating system.
  3. See #1.
  4. Usually, the argument to \scantokens is treated as a single line. Thus a percent sign in the argument to \scantokens will end input from this argument. However, any occurrences of the character whose number is \newlinechar will be used to split the argument into several lines.

To bring all these ideas together, consider the plain TeX file

\newlinechar=2
{\catcode`\%=12
 \gdef\foo{\scantokens{abc%xyz^^Bdef}}}%
\endlinechar=`X
\foo%
\bye%

which will typeset the text “abcdefX” .

(Edited to take into account what I learned about #4 from the comments.)

  • It's not that \scantokens treats its input as a single line, it's the \def that removed the new lines (which makes sense because it tokenized them). – TH. Jan 14 '11 at 21:00
  • @TH: I suppose you could put it that way. The way I see it, the point is that there is no token that corresponds to a newline. A person might naïvely think that \endlinechar would be that token, but you can put one in the middle of an input line (usually as ^^M), so that's not it. – Harald Hanche-Olsen Jan 14 '11 at 21:09
  • @Harald: Actually, when TeX encounters any character of cat code 5 (end of line), it throws away the rest of its buffer and does the normal stuff it does when it encounters that character (which depends on it being in state N, M, or S; see my blog post about this). – TH. Jan 14 '11 at 22:08
  • Ah, I had forgotten all about catcode 5. It must be because I have never seen a use for it. – Harald Hanche-Olsen Jan 14 '11 at 22:37
  • @Harald: I realized that setting \catcode\^^M=12 \newlinechar`^^Mbefore the definition works: then\scantokens` sees its argument as several lines. – Bruno Le Floch Jan 17 '11 at 12:55
  • 1
    To explain the newlinechar part more clearly, see the answer in tex core - Use of \everyeof and \endlinechar with \scantokens - TeX - LaTeX Stack Exchange and how newlinechar works with \write macro. (for debugging, it's convenient to rewrite the scantokens to \write to a temporary file, then read the content of that file to see if it's correct) – user202729 Dec 27 '21 at 03:16
  • The first paragraph is actually wrong. TeX will discard the possible “end-of-record” byte(s) (OS dependent) and replace them with the character with code \endlinechar, after discarding whatever remains on the line and possible trailing spaces (with code 32). If \endlinechar does not point to an actual code (negative or larger than the character set, so 255 for TeX), nothing is added, but the “end-of-record” byte(s) is/are discarded anyway. (continued) – egreg Mar 24 '22 at 18:00
  • (continues) Your description might be misleading, although the general effect is the same. For instance, TeX Live implementations don't really rely on the OS and guess the “end-of-record” from the file itself. – egreg Mar 24 '22 at 18:00
  • @egreg So I guess an amendment is in order. But I don't want to go into too much detail. The OS dependency is a thing of the past, I guess, from operating systems with a record oriented file system where the notion of end-of-line byte(s) doesn't even exist. But I am confused by “discarding whatever remains on the line”. After an end-of-line character, how can anything remain on the line? Does that not start a new line? – Harald Hanche-Olsen Mar 26 '22 at 13:38
0

The important points here is:

  • \scantokens behaves "like" \write.
  • \write writes a physical newline character when it sees ^^J (\newlinechar).

Therefore, if you do \scantokens {A ^^J B} it will \input 2 separate lines A and B (which is what you usually need.)

Not if you do \scantokens {A ^^M B}, \scantokens {A \par B} etc., regardless of what your operating system use to delimit end of line.

assuming ^^J and ^^M have "other" catcode here.

Side notes:

  • By default (outside \ExplSyntaxOn region), if you do e.g.

    \catcode `\^^M = \the \catcode `[ \relax
    \def \a {
    }
    

    or

    \def \a {\
    }
    

    then you'll "see" that newline always "correspond" to ^^M (char code 13), not ^^J, which may lead to the confusion if you don't know the exact behavior and only "guess".

    (that's because \endlinechar is 13.)

  • on e.g. UNIX systems, if \newlinechar is not ^^J, writing ^^J will still generate a physical newline; however scantokens will not regard it as a physical newline.

  • Catcode 5 ("end of line" catcode") is an entirely different thing, it signifies that two consecutive tokens of that catcode should be replaced with a \par token.

user202729
  • 7,143