2

How to map arbitrary brace-balanced sequences of non-outer tokens expandably and unambiguously to numbers or to strings consisting exclusively of explicit character tokens of category 12, if possible only with macros/things that can be implemented in Knuthian-TeX?

At first I thought of stringifying all tokens in a loop and then calculating some sort of unambiguous checksum, but stringifying involves losing information about categories and therefore such an approach cannot distinguish all possible token sequences.

I would be grateful for an outline of how to approach the matter. I can then think about the details of a concrete implementation myself.

However, I have my doubts:

If this could be done in a way which is reliable to one hundred percent, then this could be used as an expandable method to distinguish, for example,

  • an active character-token let equal to a non-active pendant from that pendant.
  • frozen-\relax from the \relax-primitive.
  • the nameless control-sequence (producible via \csname\endcsname or via an escape-character (backslash) at the end of a line of .tex-input while \endlinechar has a negative value) from the control-sequence whose name is csname⟨escapechar⟩endcsname (producible via \csname csname\string\endcsname\endcsname) while those control-sequences have the same non-outer meaning.
  • explicit (non-outer) character token from one-letter-control-sequence let equal to that explicit character token when the character-code corresponds to the character which forms the name of the control-sequence while \escapechar has a negative value.
  • frozen font control sequences obtained by applying \the to a font command from the original font command.
  • ...

Can I conclude that an expandable approach restricted to means provided by Knuthian-TeX is rather not possible in a way that is one hundred percent reliable and practical?

How to approach the matter if expandability/sticking to Knuthian-TeX is not an issue?

Ulrich Diez
  • 28,770
  • 3
    Leaving aside the part of the checksum (I guess that's doable), expandably I think you can't 100% because you can't map expandably over a token list without normalising catcode-1 and 2 tokens to { and }, so there's some loss there (insignificant, for most reasonable applications, but still) – Phelype Oleinik Apr 17 '22 at 14:00
  • 1
    @PhelypeOleinik Except that you can https://tex.stackexchange.com/questions/628358/get-string-ification-of-first-opening-brace-in-argument-get-string-ification?noredirect=1&lq=1 – user202729 Apr 17 '22 at 14:31
  • 1
    @Ulr I thought you already know the solution? Do something like 256 fork to distinguish these things etc. – user202729 Apr 17 '22 at 14:32
  • 1
    Although I think for some tasks TeX is just terribly slow, and it could be slower if you restrict yourself to expansion-only. – user202729 Apr 17 '22 at 14:47
  • @PhelypeOleinik My answer to Get \string-ification of first opening brace in argument?/Get \string-ification of first opening brace's matching closing brace in argument? provides code for expandably iterating on a macro-argument, detecting explicit character tokens of catcode1/2 and normalizing them. Based on this a mechanism should e feasible which both normalizes and records that the things normalized were of category 1/2. – Ulrich Diez Apr 17 '22 at 14:58
  • @user202729 I think it is not feasible in the generality requested in my question. 256 fork = fork for each of the 256 characters possible with 8bit-encodings? That would fit the request of using Knuthian-TeX. But if being a nitpicker, this does not solve the problem with frozen font-control sequences. Besides this, when deviating from Knuthian-TeX, also considering usage of unicode-engines with 1114111 code-points for characters, this might be a problem... – Ulrich Diez Apr 17 '22 at 15:03
  • @user202729 Actually I asked the question so that I can reference when myself answering other questions. E.g., recently somebody asked about creating destinations for hyperlinks by deriving the name of the destination from the tokens that in the pdf-output shall yield the visible material of the destination. You cannot unambiguously derive names for destinations from arbitrary sets of tokens. If interested in that question, see [How \string and \newcommand work?]https://tex.stackexchange.com/q/640934/118714() – Ulrich Diez Apr 17 '22 at 15:08
  • @UlrichDiez That one doesn't require expandability, does it? (in fact I think it would be very bad practice for any TeX API to require expandability) – user202729 Apr 17 '22 at 15:10
  • (Not that it doesn't happens in practice, package writers varies in familiarity with TeX expansion issues) – user202729 Apr 17 '22 at 15:11
  • @user202729 It does not require expandability, but some mechanism for unambiguous mapping of tokens for textual phrase to something usable as destination/name of target. I thought doing this time-consuming expandable might be nicer than a bunch of memory-consuming temporary assignments per hyperlink. ;-)) – Ulrich Diez Apr 17 '22 at 15:15
  • To be fair, if you assign to the same target, the additional memory requirement is constant // alternatively you can be both time and memory efficient with LuaTeX // for non-expandable expl3 has \tl_analysis_* functions, which I don't know if it can handle frozen things. – user202729 Apr 17 '22 at 15:17
  • @user202729 To be even more fair, for my own use-cases I would not attempt to map token-sequences to destination-names at all but urge the user to provide the destination-name. Then you/user can easily access these destination-names, e.g., in case of linking a destination using a different phrase/a different set of tokens for the textual phrase of the link. After all, that question is about organizing the elements of a kind of database. This should not be over-automated. But the problem of mapping token sequences I find interesting - at least it distracts me from pain. ;-) – Ulrich Diez Apr 17 '22 at 15:24
  • @UlrichDiez Yeah, getting the opening catcode-1 token is easy, but you can't get the matching closing catcode-2 token without losing its charcode (at least not with the tricks I know). For example, suppose that ] is catcode-2, then in \dostuff{hello{braced]world} you can easily see the charcode of {, but you can't(?) see that it's closed by a ], right? – Phelype Oleinik Apr 17 '22 at 17:57
  • @PhelypeOleinik "Getting hold of" the matching explicit category-2-character-token by expandable methods only, seeing its character-code - that's what the macro \UD@ExtractFirstOpeningBracesMatchingClosingBraceStringified in my answer to Get \string-ification of first opening brace in argument?/Get \string-ification of first opening brace's matching closing brace in argument? is about. ;-) It is feasible. In some of the examples Y is of category 2, and you get the result of "hitting" that Y with \string. – Ulrich Diez Apr 17 '22 at 22:48
  • @UlrichDiez Oh, sorry, I misunderstood that answer then. I'll take some time to read it to understand how it works (it's rather lengthy :). But I guess it falls out of the scope of "practical" (couldn't probably be used in the l3 kernel for iterating over a token list without a significant performance hit) – Phelype Oleinik Apr 18 '22 at 00:50
  • Maybe I'll write some explanation for that expandafter mess later... – user202729 Apr 18 '22 at 02:32
  • @PhelypeOleinik But the \tl_if_head_is_space etc. are also quadratic time ? – user202729 Apr 18 '22 at 02:32
  • @user202729 Not sure, I haven't benchmarked much. But we use token-by-token iteration as little as possible (it's not a slow process per se, but if token lists are large, it tends to get out of hand). The point really is that checking if the next token is a space is important for operating on a token list, but getting the correct charcode for a catcode-2 token is mostly for academical purposes, so even if it were fast, it doesn't seem like a win – Phelype Oleinik Apr 18 '22 at 02:43
  • @PhelypeOleinik "but getting the correct charcode for a catcode-2 token is mostly for academical purposes" - I think so, too, but the question was asked, so... ;-) There perhaps probably maybe might be some point in ensuring that things with unusual #<some explicit category-1-character-token>-notation still work out, but even I refuse squeezing an obscure pseudo-exception out of my brain where "laying hands" on character codes of explicit category-2-tokens might be of practical use. ;-) – Ulrich Diez Apr 18 '22 at 16:07
  • @UlrichDiez Sure, I wasn't discussing the merit of your question (I too like to dwell in the dark corners of TeX :). My point was on comparing this corner case with \tl_if_head_is_space:nTF in the kernel: detecting a space is often needed, while the char code of a catcode-2 token is really not that important – Phelype Oleinik Apr 18 '22 at 16:41

0 Answers0