Expandable way to tell apart a character token and an equivalent control sequence

Question

As the title says: is there any completely reliable, expandable method to differentiate between a control sequence that has been made equivalent to a character token and the character token itself? (Assuming the meaning of particular primitives was not changed: otherwise clearly all bets are off.) For example, consider the following (staying with plain TeX to keep superfluous elements to the minimum):

\font\ecrm=ecrm1000 \ecrm
\long\def\test#1#2{\hbox{%
    \vbox{\hsize=15em\escapechar=`\\ \tt\detokenize{#1}}%
    \vbox{#2}\par%
}}
\long\def\testexp#1{\test{#1}{#1}}
\long\def\testif#1{\test{#1}{#1\else not \fi equal}}
\long\def\exec#1{{\escapechar=`\\ \tt\detokenize{#1}}#1\par}
\begingroup
\exec{\let^=^}
\exec{\escapechar=-1}
\bigskip
\testexp{\meaning^}
\testexp{\meaning^}
\testexp{\string^}
\testexp{\string^}
\testexp{\detokenize{^}}
\testexp{\detokenize{^}}
\bigskip
\testif{\if^^}
\testif{\if^\noexpand^}
\testif{\ifcat^^}
\testif{\ifcat^\noexpand^}
\testif{\ifx^^}
\bigskip
\exec{\def\tmpa{^}\def\tmpb{^}}
\testif{\ifx\tmpa\tmpb}
\endgroup
\bye

(For convenience I do use the ε-TeX primitive \detokenize here, but that is not the primary point, hence the tex-core tag.)

I have found no way to make any of the conditionals recognize the difference between the two tokens: the actual character token and the control sequence. (I knew that these were not supposed to differentiate between them, but went and tested them just in case.) If I additionally assume that \escapechar is not printable, as is the case here then not even \meaning, \string, or \detokenize will treat them any differently.

That said, the TeX engine clearly does know the difference, as when these are made to be the replacement text of macros, \ifx recognizes them as not the same. Likewise when they are set as delimiters of macro arguments. However, if we want an expandable comparison of the two tokens as, say, the arguments of a macro, then this is not something we can do. Neither can we change the value of \escapechar: if it is not printable, we are stuck with it.

Is there a way to make the comparison in an expandable manner, one I have not thought of? If not in (e)TeX, then maybe in XeTeX or LuaTeX?

P.S. Another, somewhat related, weird corner case of tokens that I find are hard to tell apart: the control sequence with the empty name and — assuming, say, the normal escape character — the one with the name csname\endcsname (which is a situation where I really fail to comprehend TeX going out of its way change the rules for that one special case, messing up the injectivity of \string on macros). I know that LuaTeX’s \csstring does kind of solve that though.

"the actual character token and the macro." Note if you define \def\xx{a} then \xx is a macro expanding to a but if (as here) you define \let\xx=a then xx is not a macro. So the question in your title is misleading — David Carlisle, Dec 01 '22 at 11:45
@DavidCarlisle Fair point, I used the wrong phrase: changed “macro” to “control sequence”. — Taederias, Dec 01 '22 at 11:51
I believe it's impossible. Of course LuaTeX can, refer to Can the Lua part of LuaTeX know about tokens? - TeX - LaTeX Stack Exchange. — user202729, Dec 01 '22 at 11:58
(unrelated remark. I usually just blame these weirdness of the TeX engine on Knuth never need the features, then want to keep engine stability -- so some programming tasks are prohibitively difficult e.g. to get the hyphenation pattern of a word, soul package typeset it in a monotype font, and it fails if the font does not have some characters in the word) --" — user202729, Dec 01 '22 at 12:01
That having said there's the "fork" trick see e.g. https://tex.stackexchange.com/a/530557/250119 ... okay David already posted that below. — user202729, Dec 01 '22 at 12:03
@user202729 Yeah, I guess \directlua kind cuts through the Gordian Knot in one fell swoop in the case of LuaTeX. The flip side to the weirdness of TeX is that solving a seemingly (but not actually) simple task with some mind-boggling combination of conditionals and other stuff after racking your brain on it can be oddly satisfying. — Taederias, Dec 01 '22 at 12:53

score 7 · Accepted Answer · edited Dec 27 '22 at 05:37

7

David's solution expects that you will compare something with the fixed token ^. You cannot redefine the \testifhatx to something other at expansion level only. My solution works with two arbitrary tokens, not only with ^ and another one.

\long\def\testeq#1#2{%
    \ifx#1#2%
       \testsingleletter#1\iftrue
           \testsingleletter#2\iftrue TRUE%
           \else FALSE%
           \fi
       \else \testsingleletter#1\iftrue FALSE%
           \else TRUE%
           \fi
       \fi
    \else FALSE
    \fi
}
\long\def\testsingleletter#1{\expandafter\testsingleletterA\string#1\end}
\long\def\testsingleletterA#1#2\end{\ifx&#2&\else\expandafter\unless\fi}
\let ^=^
\testeq ^ ^   % prints TRUE
\testeq \a ^ % prints FALSE
\testeq ^ ^  % prints FALSE
\bye

Edit I created a second version, whereby you can process something like \fi in the parameter, as per your comment.

\long\def\splitif#1{#1\expandafter\ignoresecond\else\expandafter\usesecond\fi}
\long\def\usesecond#1#2{#2}  \long\def\ignoresecond#1#2{#1}
\long\def\isequal#1#2{\splitif {\ifx#1#2}%
   {\splitif {\testsinglechar#1\iftrue}%
      {\splitif{\testsinglechar#2\iftrue}{1}{0}}%
      {\splitif{\testsinglechar#2\iftrue}{0}{1}}}%
   {0}%
}
\long\def\testsinglechar#1{\expandafter\testsinglecharA\string#1\end}
\long\def\testsinglecharA#1#2\end{\ifx&#2&\else\expandafter\unless\fi}
\long\def\test #1#2{\ifnum\isequal#1#2>0 TRUE\else FALSE\fi}
\let ^=^
\test ^ ^   % prints TRUE
\test \a ^ % prints FALSE
\test ^ ^  % prints FALSE
\bye

edited Dec 27 '22 at 05:37

jarnosc

4,266

answered Dec 01 '22 at 12:53

wipet

74,238

Yeah, something like this would work in most cases. That said, it does suffer from what I highlighted in my question, the fact that in the case of non-standard \escapechar values (e.g. -1, and in this case, I believe, 32) \string will not help distinguish the two at all. Also, feeding a \fi or other such things will break the conditionals horribly, so the usual practice of \expandafter-ing something like LaTeX’s \@firstoftwo etc. out of the conditionals before putting anything with macro parameters would probably be necessary. – Taederias Dec 01 '22 at 13:28
1

Setting non-standard \escapechar is very bad idea. Definitlely. And, of course, if you want to process a tokens like \fi then you must do sometihg like \firstoftwo. But it is no problem to do this. – wipet Dec 01 '22 at 16:55
Oh, I am quite aware that is not something to do normally. What this came from was a mainly theoretical exercise of “Can you write a macro that takes two tokens as arguments and determines their equality expandably, when the one using the macro may not directly break TeX or your macros (by redefining primitive control sequences or your own helper macros), but can otherwise be as evil as possible?” – Taederias Dec 01 '22 at 22:34
Thank you for going out of your way to present a version that deals with the \fi issue. I suggested an edit to additionally make defs \long so that \par tokens are allowed too. – Taederias Dec 02 '22 at 00:01
I am also kind of debating in myself which answer to accept. As it stands — given that I already mentioned dealing with the \escapechar=-1 case in my question — I am leaning toward accepting Ulrich’s as it quite thoroughly covers what to do and be aware of in regards to my intent here, and the actual code I can write myself. But your answer provides concrete code that works in most sensible situations. Not really sure what the best choice is though. Honestly, all three answers (yours, Ulrich’s and David’s) add something important; pity I can accept only one. – Taederias Dec 02 '22 at 00:04
@Taederias I suggest unaccepting my answer and accepting wipet's instead: wipet's answer is of practical use while mine is just some sort of moot theoretical babbling. I have fun dealing with all sorts of strange borderline cases. If he wanted to, wipet could easily outdo me at it. But in practice, this makes things massively more difficult to handle, and wipet works in a practice-oriented way. You can learn from him how to handle things sensibly in practice, so that the cases you need in practice are handled by code that is still manageable a few years after you wrote it. – Ulrich Diez Dec 02 '22 at 01:22
@UlrichDiez Very well then. I was kind of unsure on what to do here in terms of balancing what I expected to find out and what others would generally find useful; seeing as I already had working code for sensible scenarios and your answer brought my attention to some stuff I did not think of, but generally people would not need to deal with the weird corner cases that occur in settings deliberately engineered to make them possible. – Taederias Dec 02 '22 at 01:30
@Taederias Exactly. Accepting an answer leads to other readers seeing that answer first. I had the impression that you are one of those who like to look at everything thoroughly in order to cope even with rather strange peculiarities of TeX. ;-) For people interested in intriguing details of how TeX works, some parts of my answer might be of interest. For them, my answer may be a nice addition. But the majority who post/ask/search here want practical solutions they can orientate/learn from. So that is what should be found first when searching here. – Ulrich Diez Dec 02 '22 at 01:49

score 5 · Answer 2 · answered Dec 01 '22 at 12:01

5

You can do an expandable test for specific character token such as ^ so long as the test is defined in advance:

\font\ecrm=ecrm1000 \ecrm
\let^=^
\long\def\testifhat#1{\testifhatx#1^\testifhatx}
\long\def\testifhatx#1^#2\testifhatx{%
  \ifcat$\detokenize{#1}$\else not \fi hat}
\testifhat^
\testifhat^
\bye

answered Dec 01 '22 at 12:01

David Carlisle

757,742

That is a neat trick! I assume from the other comments that the answer to the existence of a general purpose macro in (e)TeX that takes two tokens and determines their equality is kind of “no”… then again, I suppose one could have a macro like this for each one-character control sequence instead, then check for whichever is relevant based on the value of \string for the arguments. And if it is not that then it surely must be a character token and one can use \ifcat. That does not quite sound “sane”, but I guess this might actually work. ;) – Taederias Dec 01 '22 at 12:39
1

@Taederias yes for pdftex you could do this for every character, in xetex that would be .. less feasible – David Carlisle Dec 01 '22 at 12:45

Ulrich Diez · Answer 3 · 2022-12-02T02:57:17.887

Explicit character tokens have a character code which denotes the number of the code point of the corresponding character in the TeX-engine's internal character encoding scheme (which is ASCII with traditional TeX engines and is unicode with LuaTeX/XeTeX-based TeX-engines) and one of the categories 1 (begin grouping), 2 (end grouping), 3 (math shift), 4 (alignment tab), 6 (parameter), 7 (superscript), 8 (subscript), 10 (space), 11 (letter), 12 (other), 13(active).

Explicit character tokens of a category differing from 13(active) are not control sequences.

Explicit character tokens of category 13 (active) need special consideration:

Together with the control-sequence-tokens (which come in two flavors - "control-word-tokens" and "control-symbol-tokens") they belong to the "control sequences".

[Properties of control-symbol-tokens:

Names of control-symbol-tokens consist of a single character whose (!!!)current¹ category code is not 11(letter).
A character in the source code whose category code at the time of tokenizing is 10(space), which trails things that got tokenized as control-symbol-tokens, is not discarded but gets tokenized as an explicit space token (explicit character token of category 10(space), character code 32).
Exception: If the category code of the character that forms the name of the control-symbol-token is 10(space), then trailing characters of category code 10(space) are discarded.
When unexpanded-writing a control-symbol-token to external text file, no space is appended.

Properties of control-word-tokens:

Names of control-word-tokens either consist of several characters (not necessarily all of them having category code 11(letter)) or consist of a single character whose current category code is 11(letter).
However, with control-word-tokens which come into being by having TeX read and tokenize .tex-input, the name of the control-sequence-token is obtained after encountering in the .tex-input-file the backslash/character of category 0(escape), which denotes that a control-sequence-token is to be created, by gathering from the line of the .tex-input-file characters whose current category code is 11(letter) until encountering a character whose category code differs from 11.
Nevertheless control-sequence-tokens, thus also control-word-tokens, can come into being in other ways, too. E.g., by having TeX expand a \csname..\endcsname-expression. This way multi-character-control-word-tokens can come into being where characters with category codes differing from 11(letter) can be components of the name, too.
Characters in the source code whose category code at the time of tokenizing is 10(space), which trail things that got tokenized as control-word-token, are discarded and don't yield any token.
When unexpanded-writing a control-word-token to external text file, a space character is appended.

¹For writing unexpanded to file it can be toggled between TeX treating a control-sequence-token whose name consists of a single character as a control-word-token (where a space is appended) or as a control-symbol-token (where no space is appended) by toggling between assigning the character whose name makes up the name of the control-sequence-token either category code 11 (letter) or s.th. differing from 11. ]

If the meaning of a control sequence via \let/\futurelet is made equal to the meaning of an explicit character token whose category is not 13(active), then that control sequence is called an "implicit character token".

If the control sequence whose meaning via \let/\futurelet is made equal to the meaning of an explicit character token whose category is not 13(active) itself is an explicit character token whose category is 13(active), then that control sequence is both an explicit character token and an implicit character token at the same time.

If the meaning of a control sequence—that control sequence can be an active character or a control-word-token or a control-symbol-token—is made equal via \let or \futurelet to the meaning of an explicit character token whose category is not 13(active), then the result of \ifcat- and \if- and \ifx-comparison with that control sequence or of applying \meaning to that control sequence is the same as you get when comparison or applying \meaning is done with the explicit character token in place of the control sequence. With \ifcat and \if this is the case even if \noexpand is prepended to the control sequence in question—the control sequence in question is neither expandable nor an undefined control sequence, thus applying \noexpand has no effect. This is also the case when doing s.th. like \expandafter\ifx\noexpand⟨character-token that might be explicit or implicit⟩....

(I mention the prepending of \noexpand because in "Chapter 20: Definitions (also called Macros)" of the TeXbook you find an explanation about \ifcat where the need of using \noexpand for suppressing expansion when "looking" at active characters is explained:

\ifcat⟨token1⟩⟨token2⟩ (test if category codes agree)

This is just like \if, but it tests the category codes, not the character codes. Active characters have category 13, but you have to say ‘\noexpand⟨active character⟩’ in order to suppress expansion when you are looking at such characters with \if or \ifcat. For example, after
\catcode`[=13 \catcode`]=13 \def[{*}
the tests ‘\ifcat\noexpand[\noexpand]’ and ‘\ifcat[*’ will be true, but the test ’\ifcat\noexpand[*’ will be false.

Don't be confused by this. This refers to situations where active character tokens are expandable. But if a control sequence, e.g., an active character token, via \let/\futurelet is turned into an implicit character token, then it is not expandable.)

Thus \ifcat/\if/\ifx-comparison and examining the result of applying \meaning—things which focus on aspects of the meanings of tokens—is not useful for distinguishing explicit character tokens from their implicit pendants.

To some degree you can distinguish things by comparing the results of applying \string.
But there are edge cases where this is not possible.

In this context you might be interested in the answers to the question Distinguish active characters from non-active pendants with expansion-methods only?, which I asked about four years ago.

And in the answers to the question Cases of different tokens having same meaning and same \string-representation that can occur in the stage of expansion, which I asked about a year ago.

In expl3-manuals (source3.pdf, interface.pdf) terminology is introduced for distinguishing properties/aspects of tokens:

It is important to distinguish two aspects of a token: its "shape" (for lack of a better word), which affects the matching of delimited arguments and the comparison of token lists containing this token, and its “meaning”, which affects whether the token expands or what operation it performs. One can have tokens of different shapes with the same meaning, but not the converse.

Using this terminology one can say that implicit character tokens differ from their explicit pendants in shape, but not in meaning.

As indicated in the quote, you can use delimited arguments for expandably distinguishing specific explicit non-category-13(active)-character tokens from their implicit pendants.

But there is no general expandable method known to me, which is 100%-ly reliable for deciding whether an arbitrary character token is an explicit or an implicit character token.

Theoretically a mechanism could be implemented where auxiliary macros are used for cranking out active-character-tokens and one-character-control-sequence-tokens by means of delimited arguments.

For one thing such a mechanism needs to be defined in a way where re-defining some of these active-character-tokens/one-character-control-sequence-tokens to be \outer doesn't matter as long as the user does not provide such \outer-tokens as components of arguments her-/himself.
For another thing such a mechanism should be defined not to break when being used in alignments/tables.
For yet another thing consider that a traditional TeX-engine's internal character-encoding-scheme is ASCII and that on common computer-platforms a traditional TeX-engine assumes input-files to be encoded in some 8-bt-encoding/byte-encoding, so with traditional TeX-engines the range of possible code-point-numbers of characters is 0..255. This could probably be handled. But the trend goes towards LuaTeX and XeTeX which assume input-files to be encoded in unicode/utf-8. So with these TeX-engines the range of possible code-point-numbers of characters is 0..1114111. That's a lot.
For yet one more other thing, if you wish to expandably compare arbitrary token-lists, consider
- even more edge cases, e.g., distinguishing frozen-\relax from non-frozen-\relax, frozen \font-control-sequences from their non-frozen pendants, ...
- token-lists where brace-groups and the like are nested.
- token-lists where \if, \else, \fi and the like are not balanced.
- ...

E.g., examining the result of applying \string usually relies on applying \string to control sequences yielding more than one explicit character token. This is not the case if the control sequence in question is an active character token or if the value of \escapechar is outside the range of valid code point numbers of characters while the name of the control sequence consists of a single character.
If examining is done by means of macros that process undelimited arguments, then the first and/or the second of these explicit character tokens being an explicit space-token (category 10, character code 32) might be a problem, too. (Like \meaning etc \string delivers explicit character tokens of category 12(other). With one exception: Character tokens of character code 32 delivered by things like \string/\meaning/\detokenize/etc always are of category 10(space) and thus are explicit space tokens which may be discarded in the course of gathering the first token belonging to an undelimited macro argument...)

Applying \string to the nameless control sequence [which can come into being by expanding \csname\endcsname or by ending a line of .tex-source-code with a backslash (a character of category 0(escape)) while \endlinechar has a negative value]

in case the value of \escapechar not being in the range of valid code-point-numbers of characters yields a sequence of explicit category-12(other)-character-tokens csnameendcsname,
in case the value of \escapechar being 32 yields an explicit space token (explicit character token of category 10 and character code 32) trailed by a sequence of explicit category-12(other)-character-tokens csname, trailed by an explicit space token, trailed by a sequence of explicit category-12(other)-character-tokens endcsname,
in case the value of \escapechar being anything else within the range of valid code-point-numbers of characters yields an explicit character token of category 12(other) and character-code equal to the value of \escapechar trailed by a sequence of explicit category-12(other)-character-tokens csname, trailed by an explicit character token of category 12(other) and character-code equal to the value of \escapechar, trailed by a sequence of explicit category-12(other)-character-tokens endcsname.

You get the same by applying \string to a different token which comes into being by doing \csname csname\string\endcsname\endcsname.

If both the nameless control sequence token and that token via \let or \futurelet are made equal, e.g., to the same explicit non-category-13-character-token, then distinguishing them by expansion-methods is tricky and can probably only be done by cranking things out via macros that process delimited arguments.

Be aware that LuaTeX-based engines are a different matter: As \directlua is expandable, you can, e.g., use token.scan_toks() for getting a brace-delimited set of tokens into a Lua-table of tokens and then examine properties of tokens stored in that table. token.get_next() might also be of interest.

In TUGboat, Volume 36 (2015), No. 1, you find an article `Still tokens: LuaTeX scanners" by Hans Hagen about this.

Wow, that was a very thorough answer. It would appear that there are a whole lot of weird edge cases to consider. And yeah, dealing with every character individually in unicode-based engines is probably not something that should be done. Even if the engine would be able to deal with it, theoretically. Not sure it would though, one can very easily run into some limits there. — Taederias, Dec 01 '22 at 23:20
Also indeed my original intent was to compare arbitrary token lists in an expandable manner, but decided to go for single tokens first. As long as I can prohibit a particular token or token sequence (to be used as delimiter) and find a way to expandably detect leading character tokens with category 1, I think I see a way to go for the issue of token lists from there. I somehow didn’t find the questions you linked to when searching around, even though they touch upon the same thing. — Taederias, Dec 01 '22 at 23:27
@Taederias Expandably detecting leading explicit character tokens of category 1 of a macro argument is feasible. Back in December 2021 there was the question "Get \string-ification of first opening brace in argument?/Get \string-ification of first opening brace's matching closing brace in argument?". My answer beneath other things contains code for a macro \UD@CheckWhetherBrace which does that check expandably without the need for delimiters/forbidden tokens/sentinel-tokens or the like. — Ulrich Diez, Dec 02 '22 at 00:24
@Taederias I think, if you are picky, then the crucial point with expandably comparing arbitrary token lists is the task of distinguishing frozen font control sequences from the original font commands. This is crucial because any control sequence might be redefined to be a font command. While \relax and the set of active character tokens and the set of one-character-control-sequence-tokens form a set of tokens which is arranged clearly, this is not the case for the set of tokens that might make up (frozen) font-commands. ;-) — Ulrich Diez, Dec 02 '22 at 01:01
Okay, so I looked at the code in the linked post from your comment about checking for braces, and… wow, that is horrible. I love it! :D Anyway, do you know where I can find the documentation on precise the workings of frozen font control sequences and whatnot (and if there are any other “frozen” sequences like this and \relax)? At the very least, a cursory look through the TeXbook did not yield the desired results for me. — Taederias, Dec 05 '22 at 12:50
@Taederias About the test for opening braces: The linked post also contains an expandable test \UD@CheckWhetherNull for emptiness, not based on \detokenize. The gist both of that test and of \UD@CheckWhetherBrace is the same: "Hitting" the first token of the argument or the token trailing the empty argument by \string and using the fact that this might affect how brace-pairs are matched for cranking out diverse cases. I elaborated on this in in my answer to Expandable test for an empty token list—methods, performance, and robustness. — Ulrich Diez, Dec 05 '22 at 13:39
@Taederias About frozen font-control-sequences: See, e.g., Part 30: Font Metric Data, § 548 of Donald Ervin Knuth's Computers and Typesetting, Volume B: TeX, The Program - The Complete Source Code and Program Listing for TeX. A somewhat stripped down version thereof as pdf-file can be found on CTAN. ... — Ulrich Diez, Dec 05 '22 at 13:44
@Taederias ... Part 30: Font Metric Data, § 548 says: "When the user defines \font\f, say, TeX assigns an internal number to the user’s font \f. Adding this number to font id base gives the eqtb location of a “frozen” control sequence that will always select the font." Part 18, § 256. (The hash table) might be of interest, too. — Ulrich Diez, Dec 05 '22 at 13:44

Expandable way to tell apart a character token and an equivalent control sequence

3 Answers3