5

This question refers to LuaTeX-engines prior to luatex 1.10.0 in texlive 2019.
With luatex 1.10.1 in texlive 2019 and later the \string-primitive works as expected.


Where can I find exact documentation of the \string-primitive in LuaTeX?

I ask this question because I encounter some unexpected (at least by me) behavior when compiling the example below with LuaTeX.

With ordinary TeX, \string<control sequence token> will deliver a sequence of explicit character tokens of catcode 12(other) denoting the name of the control sequence token in question. Exception: Characters whose number/character code is 32(space) will be of catcode 10(space), i.e., they will be explicit space character tokens. (Control sequence tokens whose names contain spaces can either be the control symbol token \ (control space) or be constructed cia \csname..\endcsname.)
In case the value of the integer parameter \escapechar is in range 0..255, an explicit character token of catcode 12(other) whose charcode equals the value of \escapechar will precede that sequence of explicit character tokens. Exception: In the case of \escapechar's value beig 32, the catcode of the preceding explicit character token will be 10(space), i.e., in this case the preceding explicit character token will be an explicit space character token.

Seems with XeTeX the upper bound of the \escapechar-range wherein \string will also deliver that preceding explicit character token is not 255 but 1114111.

But with LuaTeX I get one kind of unexpected behavior for \escapechar-values from 1114112 to 1114239 (=1114111+128) and other kinds of unexpected behavior for \escapechar-values from 1114240 to 2097151:

(Of course the following example needs to be compiled with LuaTeX.)

\def\grab#1#2#3{1:#1;2:#2;3:#3;}
\def\grabteststring#1{\begingroup\escapechar=#1 \expandafter\endgroup\expandafter\grab\string\XXX AA}%
  1. \grabteststring{-7} % Expected behavior occurred: No escapechar attached.

  2. \grabteststring{0} % Expected behavior occurred: Visible escapechar attached.

  3. \grabteststring{66} % Expected behavior occurred: Visible escapechar attached.

  4. \grabteststring{1114111} % Expected behavior occurred: Invisible escapechar attached.

  5. \grabteststring{1114112} % Unexpected behavior occurred: Visible escapechar attached. % (I would have expected: No escapechar attached)

  6. \grabteststring{1114239} % Unexpected behavior occurred: Visible escapechar attached. % (I would have expected: No escapechar attached)

  7. \grabteststring{1114240} % Unexpected behavior occurred: % - .log file: !String contains an invalid utf-8 sequence % - Two of the X seem silently gobbled. % (I would have expected: No escapechar attached)

  8. \grabteststring{2097151} % Unexpected behavior occurred: % Seems the \string-primitive delivers the phrase: % warning (print): bad raw byte to print (c=983939), skipped % (I would have expected: No escapechar attached)

  9. \grabteststring{2097152} % Expected behavior occurred: No escapechar attached.

\bye

Here comes a screenshot of the resulting .pdf-file:

enter image description here

Here comes the .log-file:

This is LuaTeX, Version 1.0.0 (MiKTeX 2.9.6210 64-bit)  (format=luatex 2017.3.1)  22 JUN 2018 01:05
 restricted system commands enabled.
**test.tex
(./test.tex
! String contains an invalid utf-8 sequence.
€grabteststring ...expandafter grab string XXX 
                                                  AA
l.28 7. \grabteststring{1114240}
                               % Unexpected behavior occurred:
A funny symbol that I can't read has just been (re)read.
Just continue, I'll change it to 0xFFFD.

[1{XXXXXXXXXXXXX/MiKTeX/2.9/pdftex/config/pdftex.map} Missing character: There is no ô¿¿ (U+10FFFF) in font cmr10! Missing character: There is no � (U+FFFD) in font cmr10! ])<XXXXXXXXXXXXX/MiKTeX 2.9/fonts/type1/public/amsfonts/cm/cmr10.pfb> Output written on test.pdf (1 page, 18202 bytes).

PDF statistics: 10 PDF objects out of 1000 (max. 8388607) 0 named destinations out of 1000 (max. 131072) 1 words of extra memory for PDF output out of 10000 (max. 100000000)


In other words:

Both with XeTeX and with ordinary TeX the \string-primitive will - when "stringifying" a control sequence token - produce a preceding escape character token if and only if the value of the integer-parameter \escapechar is within the range of the input-encoding, i,e, within 0(dec)..255(dec) = 0(hex)..FF(hex) for 8bit-encodings and within 0(dec)..1114111(dec) = 0(hex)..10FFFF(hex) for UTF-8-encoding, and will silently not produce any preceding escapechar character token if and only if the value of the integer-parameter \escapechar is not within the range of the input-encoding. Thus with XeTeX and ordinary TeX the behavior of the \string-primitive is well-defined for every value that can be assigned to the \escapechar-integer-parameter.

Intuitively I'd expect the same behavior with LuaTeX as I could not find any user manual stating that with LuaTeX the behavior of the \string-primitive does deviate significantly from the behavior of the \string-primitives of other TeX-engines. But with LuaTeX the \sting-primitive behaves different for \escapechar-values larger than the upper bound of the range given by the uft8-input-encoding.

So the questions are:

  1. What is the exact behavior of the LuaTeX-\string-primitive with what \escapechar-values?

  2. Is this a bug in LuaTeX?

Ulrich Diez
  • 28,770
  • I believe that the implementation of \string in the core is sprint_cs. – Henri Menke Jun 22 '18 at 01:18
  • As you are setting \escapechar outside of the UTF-8 range, I suspect the behaviour here is simply undefined. (LuaTeX manual section 1.3.3 'Extended ranges' is clear that values up to "10FFFF are valid for a range of primitive inputs.) – Joseph Wright Jun 22 '18 at 05:55
  • 1
    also relevant is luatex's use of characters just above the unicode range to write raw bytes to the output streams – David Carlisle Jun 22 '18 at 07:09
  • @JosephWright Of course values larger than 1114111 (dez) = 10FFFF (hex) are outside UTF8-range. So are negative values. With ordinary TeX with 8bit-encodings one can set \escapechar both to negative values and to values > 255 and thus outside the range of 8bit-encodings. With ordinary TeX (and with XeTeX) in case of the value being outside the range of the encoding, no escape-character is produced. I'd expect the same behavior with LuaTeX. What is the exact behavior of the LuaTeX-\string-primitive with what \escapechar-values? – Ulrich Diez Jun 22 '18 at 07:43
  • @UlrichDiez Why would you expect to be able to use values about the top of the range? It's not mentioned in the LuaTeX manual at all, so the behaviour is simply undefined. (It does allow you to set \escapechar=-1, which is the only legitimate use of an out-of-range value that I know of.) – Joseph Wright Jun 22 '18 at 08:09
  • @JosephWright I expect that because I did not find any statement that with LuaTeX-engines the behavior of the \string-primitive would significantly deviate from the behavior of the \string-primitive of other engines. With other engines behavior of \string-primitive is well defined also for \escapechar's value not being within the range of the input-encoding in question: Silently no preceding escapechar-character token gets produced in such cases. With LuaTeX I get weirdness, e.g., with tests 7 and 8 of my example. – Ulrich Diez Jun 22 '18 at 08:15
  • @HenriMenke What programming language is this? C++? (Oh no... I already had to refresh my knowledges of PASCAL when reading TeX: The Program...) ;-) – Ulrich Diez Jun 22 '18 at 08:27
  • 1
    I think it's bad to not have a well determined behavior when \escapechar is outside the Unicode range, but LuaTeX never claims to be compatible with other TeX engines in all respects. – egreg Jun 22 '18 at 08:40
  • @egreg Actually I do not insist in compatibility. But I did not find any statement in whatsoever user manual about LuaTeX' \string-primitive deviating from the \string-primitives of other engines in terms of the aspect that with LuaTeX values can be assigned to \escapechar where the behavior of the \sting-primitive is not well defined. As I did not find such a statement while usually changes of behavior going back to purposeful will are well explained, I suspect/fear that this -eh - differing behavior is not one of the aspects that came into being by purposeful will. – Ulrich Diez Jun 22 '18 at 09:08
  • @UlrichDiez Indeed! There have been other cases where the developers simply had overlooked something. – egreg Jun 22 '18 at 09:09
  • @egreg So for me the crucial question is: Is it a bug/something worth developers' attention to be drawn on or do I overlook something? ;-) – Ulrich Diez Jun 22 '18 at 09:11
  • 1
    @UlrichDiez The usual strategy is to ask on the LuaTeX mailing list and point out the issue. – egreg Jun 22 '18 at 09:14
  • @JosephWright I just found in TeXbook's answer to exercise 7.5 (TeXbook is about an engine with 8bit-encoding, range 0..255) that you can obtain a single explicit catcode-12-backslash-character token via \string\\ if \escapechar's value is negative or greater than 255. This implies that \string is not to produce an escape character in case the value of \escapechar is outside the range given by the encoding. – Ulrich Diez Jun 24 '18 at 16:01

1 Answers1

3

with luatex 1.10.0 in texlive 2019 you get the expected behaviour

enter image description here

with the single log warning

Missing character: There is no  (U+10FFFF) in font cmr10!
David Carlisle
  • 757,742