Can you map arbitrary token sequences expandably and unambiguously to numbers or to strings of explicit character tokens of category 12?

Question

How to map arbitrary brace-balanced sequences of non-outer tokens expandably and unambiguously to numbers or to strings consisting exclusively of explicit character tokens of category 12, if possible only with macros/things that can be implemented in Knuthian-TeX?

At first I thought of stringifying all tokens in a loop and then calculating some sort of unambiguous checksum, but stringifying involves losing information about categories and therefore such an approach cannot distinguish all possible token sequences.

I would be grateful for an outline of how to approach the matter. I can then think about the details of a concrete implementation myself.

However, I have my doubts:

If this could be done in a way which is reliable to one hundred percent, then this could be used as an expandable method to distinguish, for example,

an active character-token let equal to a non-active pendant from that pendant.
frozen-\relax from the \relax-primitive.
the nameless control-sequence (producible via \csname\endcsname or via an escape-character (backslash) at the end of a line of .tex-input while \endlinechar has a negative value) from the control-sequence whose name is csname⟨escapechar⟩endcsname (producible via \csname csname\string\endcsname\endcsname) while those control-sequences have the same non-outer meaning.
explicit (non-outer) character token from one-letter-control-sequence let equal to that explicit character token when the character-code corresponds to the character which forms the name of the control-sequence while \escapechar has a negative value.
frozen font control sequences obtained by applying \the to a font command from the original font command.
...

Can I conclude that an expandable approach restricted to means provided by Knuthian-TeX is rather not possible in a way that is one hundred percent reliable and practical?

How to approach the matter if expandability/sticking to Knuthian-TeX is not an issue?

Leaving aside the part of the checksum (I guess that's doable), expandably I think you can't 100% because you can't map expandably over a token list without normalising catcode-1 and 2 tokens to { and }, so there's some loss there (insignificant, for most reasonable applications, but still) — Phelype Oleinik, Apr 17 '22 at 14:00
@PhelypeOleinik Except that you can https://tex.stackexchange.com/questions/628358/get-string-ification-of-first-opening-brace-in-argument-get-string-ification?noredirect=1&lq=1 — user202729, Apr 17 '22 at 14:31
@Ulr I thought you already know the solution? Do something like 256 fork to distinguish these things etc. — user202729, Apr 17 '22 at 14:32
Although I think for some tasks TeX is just terribly slow, and it could be slower if you restrict yourself to expansion-only. — user202729, Apr 17 '22 at 14:47
@PhelypeOleinik My answer to Get \string-ification of first opening brace in argument?/Get \string-ification of first opening brace's matching closing brace in argument? provides code for expandably iterating on a macro-argument, detecting explicit character tokens of catcode1/2 and normalizing them. Based on this a mechanism should e feasible which both normalizes and records that the things normalized were of category 1/2. — Ulrich Diez, Apr 17 '22 at 14:58
@user202729 I think it is not feasible in the generality requested in my question. 256 fork = fork for each of the 256 characters possible with 8bit-encodings? That would fit the request of using Knuthian-TeX. But if being a nitpicker, this does not solve the problem with frozen font-control sequences. Besides this, when deviating from Knuthian-TeX, also considering usage of unicode-engines with 1114111 code-points for characters, this might be a problem... — Ulrich Diez, Apr 17 '22 at 15:03
@user202729 Actually I asked the question so that I can reference when myself answering other questions. E.g., recently somebody asked about creating destinations for hyperlinks by deriving the name of the destination from the tokens that in the pdf-output shall yield the visible material of the destination. You cannot unambiguously derive names for destinations from arbitrary sets of tokens. If interested in that question, see [How \string and \newcommand work?]https://tex.stackexchange.com/q/640934/118714() — Ulrich Diez, Apr 17 '22 at 15:08
@UlrichDiez That one doesn't require expandability, does it? (in fact I think it would be very bad practice for any TeX API to require expandability) — user202729, Apr 17 '22 at 15:10
(Not that it doesn't happens in practice, package writers varies in familiarity with TeX expansion issues) — user202729, Apr 17 '22 at 15:11
@user202729 It does not require expandability, but some mechanism for unambiguous mapping of tokens for textual phrase to something usable as destination/name of target. I thought doing this time-consuming expandable might be nicer than a bunch of memory-consuming temporary assignments per hyperlink. ;-)) — Ulrich Diez, Apr 17 '22 at 15:15
To be fair, if you assign to the same target, the additional memory requirement is constant // alternatively you can be both time and memory efficient with LuaTeX // for non-expandable expl3 has \tl_analysis_* functions, which I don't know if it can handle frozen things. — user202729, Apr 17 '22 at 15:17
@user202729 To be even more fair, for my own use-cases I would not attempt to map token-sequences to destination-names at all but urge the user to provide the destination-name. Then you/user can easily access these destination-names, e.g., in case of linking a destination using a different phrase/a different set of tokens for the textual phrase of the link. After all, that question is about organizing the elements of a kind of database. This should not be over-automated. But the problem of mapping token sequences I find interesting - at least it distracts me from pain. ;-) — Ulrich Diez, Apr 17 '22 at 15:24
@UlrichDiez Yeah, getting the opening catcode-1 token is easy, but you can't get the matching closing catcode-2 token without losing its charcode (at least not with the tricks I know). For example, suppose that ] is catcode-2, then in \dostuff{hello{braced]world} you can easily see the charcode of {, but you can't(?) see that it's closed by a ], right? — Phelype Oleinik, Apr 17 '22 at 17:57
@PhelypeOleinik "Getting hold of" the matching explicit category-2-character-token by expandable methods only, seeing its character-code - that's what the macro \UD@ExtractFirstOpeningBracesMatchingClosingBraceStringified in my answer to Get \string-ification of first opening brace in argument?/Get \string-ification of first opening brace's matching closing brace in argument? is about. ;-) It is feasible. In some of the examples Y is of category 2, and you get the result of "hitting" that Y with \string. — Ulrich Diez, Apr 17 '22 at 22:48
@UlrichDiez Oh, sorry, I misunderstood that answer then. I'll take some time to read it to understand how it works (it's rather lengthy :). But I guess it falls out of the scope of "practical" (couldn't probably be used in the l3 kernel for iterating over a token list without a significant performance hit) — Phelype Oleinik, Apr 18 '22 at 00:50
Maybe I'll write some explanation for that expandafter mess later... — user202729, Apr 18 '22 at 02:32
@PhelypeOleinik But the \tl_if_head_is_space etc. are also quadratic time ? — user202729, Apr 18 '22 at 02:32
@user202729 Not sure, I haven't benchmarked much. But we use token-by-token iteration as little as possible (it's not a slow process per se, but if token lists are large, it tends to get out of hand). The point really is that checking if the next token is a space is important for operating on a token list, but getting the correct charcode for a catcode-2 token is mostly for academical purposes, so even if it were fast, it doesn't seem like a win — Phelype Oleinik, Apr 18 '22 at 02:43
@PhelypeOleinik "but getting the correct charcode for a catcode-2 token is mostly for academical purposes" - I think so, too, but the question was asked, so... ;-) There perhaps probably maybe might be some point in ensuring that things with unusual #<some explicit category-1-character-token>-notation still work out, but even I refuse squeezing an obscure pseudo-exception out of my brain where "laying hands" on character codes of explicit category-2-tokens might be of practical use. ;-) — Ulrich Diez, Apr 18 '22 at 16:07
@UlrichDiez Sure, I wasn't discussing the merit of your question (I too like to dwell in the dark corners of TeX :). My point was on comparing this corner case with \tl_if_head_is_space:nTF in the kernel: detecting a space is often needed, while the char code of a catcode-2 token is really not that important — Phelype Oleinik, Apr 18 '22 at 16:41

Can you map arbitrary token sequences expandably and unambiguously to numbers or to strings of explicit character tokens of category 12?

0 Answers0