Trying to implement a simple algorithms in TeX language, I've tried to implement a string length function (potentially a token counter function).
The concept "string" is vague in TeX.
Right after reading and tokenizing the .tex-input-file— referring to Donald E. Knuth's analogy of TeX being an organism with eyes and a digestive tract, this is done by TeX's eyes and mouth—everything in TeX is about tokens (explicit character tokens, control word tokens, control symbol tokens).
With the expansion of macros—this is done in TeX's gullet—everything is about macro arguments. A macro argument can be empty or consist of a single token or consist of several tokens.
The mechanism implemented by you basically recursively calls the macro \measurelen for removing an undelimited macro argument and — in case the meaning of the first token of the argument does not equal the meaning of the token \endstr — incrementing a counter and calling itself again.
The mechanism implemented by you actually does not count tokens, but does count undelimited macro arguments. (However, a single macro argument, counted as one item, can consist of several character tokens...)
Some questions arise about how your routine should handle things:
- Assume the macro argument/the set of tokens which forms the "string" also contains curly braces that are balanced, i.e., some explicit character tokens of category 1(begin group) and 2(end group)—
{1/}2. These tokens usually might not yield visible output. They might affect grouping/scoping while TeX is running. Shall such tokens be counted by the \strlen-routine anyway?
What result do you wish to obtain with \strlen{A {BCD{EFG}H } {}I J}? Shall it be 20 (braces counted)? Shall it be 14 (braces not counted)? The further might make sense when doing s.th. like \strlen{#2}...\scantokens\expandafter{\string\verb*#1#2#1} where #2 denotes a set of unexpandable explicit character tokens not of category 6(parameter) and #1 denotes a verb-delimiter which does not occur within #2.
- Assume the macro argument/the set of tokens which forms the "string" also contains explicit character tokens of category 6(parameter)/hashes
#6. Shall each hash-character-token be counted on its own or shall two consecutive hash-character-tokens be taken and counted for a single one? The latter might make sense when defining a scratch macro from the argument directly via \def so that at the time of expanding the scratch-macro two consecutive hashes are collapsed into a single one.
- How to count tokens that don't yield single characters but yield, e.g., inclusion of a graphic, drawing of a rule, assigning to a macro/register, gobbling/removal of subsequent tokens?
- How to count characters which do not denote the drawing of glyphes/of instances of graphemes like digits or letters but denote, e.g., that a linebreak shall (not) occur (s. th. like ASCII's CR and LF; s.th. like the "word joiner" of utf 8) or that some horizintal space shall be inserted (space, nobreak-space, enspace, emspace)?
- How to count token-sequences that yield a single accented letter?
- Shall the mechanism work out both on 8-bit-engines and on utf-8-engines like XeTeX/LuaTeX? If so, how to handle multibyte-utf-8-characters in the 8-bit-engine? In 8-bit-engines with LaTeX utf-8-encoded files are processed using the inputenc-package with utf-8-option so that in 8-bit engines with LaTeX a multibyte-utf-8-character is represented by several tokens, each coming from tokenizing the .tex-input-file 8bit-character-wise instead of multibyte-character-wise.
A basic remark:
In TeX tokens can come into being in two ways:
- By having TeX read and tokenize stuff from the .tex-input-file.
- During expansion by having TeX replace tokens by tokens that form replacement text (e.g., of macro-tokens and their arguments, e.g., the result of expanding a
\csname..\endcsname-expression, e.g., the result of carrying out a \the-directive).
Let's look at your code:
\def\strlen{\begingroup\catcode`\ =11\strlenmain}
- Don't switch space to category 11(letter)! If after doing so the tokens that form the argument of
\strlenmain are to come into being by having TeX read and tokenize from the .tex-input-file, spaces in the .tex-input-file can be taken for components of names of control sequence tokens. In the same way in which \makeatletter is used for "telling" TeX that henceforth @ can be a component the name of control sequence token that comes into being by having TeX read and tokenize from the .tex-input-file. E.g., with \strlen\frase %%% the argument of \strlenmain would not be a token \frase but would be a token \frase⟨space⟩.
- Switching the category code of the space character does not help with material which does not come from having TeX read and tokenize from the .tex-input-file but which does come into being due to expansion or which is passed to
\strlen by another macro.
E.g., with \strlen\frase the tokens that at the time of expanding \frase form the replacement text of \frase were read and tokenized from the .tex-input-file at the time when \frase was defined, i.e., at a time when \strlen's temporary change of the category code of the space character was not in effect and thus did not affect how things got tokenized.
E.g., if with \def\PassToSrlen#1{\strlen{#1}} you do \PassToSrlen{A B C}, the argument of \strlen will be A B C where spaces got tokenized as usual/as explicit space tokens of category 10(space) and character code 32 because the argument of \PassToSrlen, which is passed on to \strlen, is tokenized at a time when \strlen's temporary change of the category code of the space character is not in effect yet and thus does not affect how things get tokenized.
So you still need to consider the case of there being explicit space tokens of category 10(space) and character code 32.
- As there is nothing between
11 and \strlenmain that might indicate that the digit-sequence forming the number is finished, TeX will expand \strlenmain for finding out whether there are more digits (and if so raising an error about there not being a valid catcode). Thus \strlenmain will be toplevel-expanded in the course of evaluating how the category-code-assignment shall be done, not when the category-code-assignment is already done. In this case it doesn't matter as the first token coming from expanding \strlenmain is \countdef which clearly is not a digit. But there are scenarios where having things expanded while the digit-token-sequence forming the ⟨number⟩-quantity is not terminated might bite you, so if a TeX-⟨number⟩-quantity is to be formed by a sequence of explicit digit-character-tokens it is good practice to terminate that sequence of digit-character-tokens with an explicit space token. The space token will be removed by TeX's routine for gathering sequences of explicit digit-character-tokens as TeX-⟨number⟩-quantities.
\def\strlenmain#1{%
\countdef\len=0
\len=0
\edef\arg{#1}
\def\measurelen##1 [... etc etc]
}
- Probably you don't need to define
\len and \measurelen each time when \strlenmain / \strlen is carried out.
I suggest defining these things outside the group/outside \strlenmain.
- You use
\edef for defining a scratch macro that shall deliver the expansion of the argument. If the argument contains single hash-character-tokens of category 6(parameter), these will be taken for s.th. that denotes an argument of the macro \arg while the macro \arg is being defined without the \edef-assignment having a parameter text. Thus you will get an error in this case. If the argument contains sequences of more than one such hash-character-token, at the time of expanding \arg (\expandafter\measurelen\arg\endstr) every first and every second such hash-character-token are collapsed into a single hash-character-token. This collapsing of hash-character-tokens affects the counting of tokens. With recent TeX engines you can use \expanded{...} instead of defining a scratch macro. Then the hashes won't go into the replacement text of a macro definition and thus problems coming from hashes going into the replacement text of a macro-definition are obsolete.
Actually your mechanism counts undelimited arguments, not tokens.
What result do you wish to obtain with \strlen{A {BCD{EFG}H } {}I J}?
TeX discards explicit space tokens (character code 32, category 10) while scanning for the first token that belongs to an undelimited argument. E.g, after \def\processtwo{(#1)/(#2)} \processtwo{A} {B} and \processtwo{A}{B} and \processtwo A {B} and \processtwo A{B} and \processtwo{A} B and \processtwo{A}B and \processtwo A B and \processtwo AB all yield the same, namely (A)/(B). (There being or not being an explicit space token between the first and the second undelimited argument of \processtwo doesn't matter.)
If your "string" contains unbalanced \else or \fi, then these might erroneously match up the \ifx-comparison done by \measurelen.
If you really want to count tokens no matter
- if at some stage of processing other than the stage of expansion their processing yields characters in an amount differing from the amount of tokens—
e.g., the .tex-input
\char65⟨space⟩ usually yields four tokens (\char, 6, 5 and an explicit space token) which in turn yield placing a single character A into the output-file/.pdf-file
- that results of counting might differ depending on whether an 8-bit-TeX-engine or a utf-8-TeX-enginec (like XeTeX or LuaTeX) is in use
, while having e-TeX-extensions (\numexpr) and \expanded available, you can try s.th. like this:
\errorcontextlines=10000
\catcode`\@=11
%%=============================================================================
%% PARAPHERNALIA:
%% \UD@firstoftwo, \UD@secondoftwo, \UD@PassFirstToSecond, \UD@Exchange,
%% \UD@removespace, \UD@stopromannumeral, \UD@CheckWhetherNull,
%% \UD@CheckWhetherBrace, \UD@CheckWhetherLeadingExplicitSpace,
%% \UD@ExtractFirstArg
%%=============================================================================
\long\def\UD@firstoftwo#1#2{#1}%
\long\def\UD@secondoftwo#1#2{#2}%
\long\def\UD@PassFirstToSecond#1#2{#2{#1}}%
\long\def\UD@Exchange#1#2{#2#1}%
\UD@Exchange{ }{\def\UD@removespace}{}%
\chardef\UD@stopromannumeral=`\^^00%
%%-----------------------------------------------------------------------------
%% Check whether argument is empty:
%%.............................................................................
%% \UD@CheckWhetherNull{<Argument which is to be checked>}%
%% {<Tokens to be delivered in case that argument
%% which is to be checked is empty>}%
%% {<Tokens to be delivered in case that argument
%% which is to be checked is not empty>}%
%%
%% The gist of this macro comes from Robert R. Schneck's \ifempty-macro:
%% <https://groups.google.com/forum/#!original/comp.text.tex/kuOEIQIrElc/lUg37FmhA74J>
\long\def\UD@CheckWhetherNull#1{%
\romannumeral\expandafter\UD@secondoftwo\string{\expandafter
\UD@secondoftwo\expandafter{\expandafter{\string#1}\expandafter
\UD@secondoftwo\string}\expandafter\UD@firstoftwo\expandafter{\expandafter
\UD@secondoftwo\string}\expandafter\UD@stopromannumeral\UD@secondoftwo}{%
\expandafter\UD@stopromannumeral\UD@firstoftwo}%
}%
%%-----------------------------------------------------------------------------
%% Check whether argument's first token is a catcode-1-character
%%.............................................................................
%% \CheckWhetherBrace{<Argument which is to be checked>}%
%% {<Tokens to be delivered in case that argument
%% which is to be checked has a leading
%% explicit catcode-1-character-token>}%
%% {<Tokens to be delivered in case that argument
%% which is to be checked does not have a
%% leading explicit catcode-1-character-token>}%
\long\def\UD@CheckWhetherBrace#1{%
\romannumeral\expandafter\UD@secondoftwo\expandafter{\expandafter{%
\string#1.}\expandafter\UD@firstoftwo\expandafter{\expandafter
\UD@secondoftwo\string}\expandafter\UD@stopromannumeral\UD@firstoftwo}{%
\expandafter\UD@stopromannumeral\UD@secondoftwo}%
}%
%%-----------------------------------------------------------------------------
%% Check whether brace-balanced argument starts with a space-token
%%.............................................................................
%% \UD@CheckWhetherLeadingExplicitSpace{<Argument which is to be checked>}%
%% {<Tokens to be delivered in case <argument
%% which is to be checked> does have a
%% leading explicit space-token>}%
%% {<Tokens to be delivered in case <argument
%% which is to be checked> does not have a
%% a leading explicit space-token>}%
\long\def\UD@CheckWhetherLeadingExplicitSpace#1{%
\romannumeral\UD@CheckWhetherNull{#1}%
{\expandafter\UD@stopromannumeral\UD@secondoftwo}%
{%
% Let's nest things into \UD@firstoftwo{...}{} to make sure they are nested in braces
% and thus do not disturb when the test is carried out within \halign/\valign:
\expandafter\UD@firstoftwo\expandafter{%
\expandafter\expandafter\expandafter\UD@stopromannumeral
\romannumeral\expandafter\UD@secondoftwo
\string{\UD@CheckWhetherLeadingExplicitSpaceB.#1 }{}%
}{}%
}%
}%
\long\def\UD@CheckWhetherLeadingExplicitSpaceB#1 {%
\expandafter\UD@CheckWhetherNull\expandafter{\UD@firstoftwo{}#1}%
{\UD@Exchange{\UD@firstoftwo}}{\UD@Exchange{\UD@secondoftwo}}%
{\expandafter\expandafter\expandafter\UD@stopromannumeral
\expandafter\expandafter\expandafter}%
\expandafter\UD@secondoftwo\expandafter{\string}%
}%
%%-----------------------------------------------------------------------------
%% Extract first inner undelimited argument:
%%
%% \UD@ExtractFirstArg{ABCDE} yields {A}
%%
%% \UD@ExtractFirstArg{{AB}CDE} yields {AB}
%%
%% Due to \romannumeral-expansion the result is delivered after two
%% expansion-steps/after "hitting" \ExtractFirstArg with \expandafter
%% twice.
%%
%% \UD@ExtractFirstArg's argument must not be blank.
%%
%% Use frozen-\relax as delimiter for speeding things up.
%% I chose frozen-\relax because David Carlisle pointed out in
%% <https://tex.stackexchange.com/a/578877>
%% that frozen-\relax cannot be (re)defined in terms of \outer and cannot be
%% affected by \uppercase/\lowercase.
%%
%% \ExtractFirstArg's argument may contain frozen-\relax:
%% The only effect is that internally more iterations are needed for
%% obtaining the result.
%%
%%.............................................................................
\expandafter\expandafter\expandafter\UD@Exchange
\expandafter\expandafter\expandafter{%
\expandafter\expandafter\ifnum0=0\fi}%
{\long\def\UD@RemoveTillFrozenrelax#1#2}{{#1}}%
%
\expandafter\UD@PassFirstToSecond\expandafter{%
\romannumeral\expandafter
\UD@PassFirstToSecond\expandafter{\romannumeral
\expandafter\expandafter\expandafter\UD@Exchange
\expandafter\expandafter\expandafter{%
\expandafter\expandafter\ifnum0=0\fi}{\UD@stopromannumeral#1{}}%
}{%
\UD@stopromannumeral\romannumeral\UD@ExtractFirstArgLoop
}%
}{%
\long\def\UD@ExtractFirstArg#1%
}%
\long\def\UD@ExtractFirstArgLoop#1{%
\expandafter\UD@CheckWhetherNull\expandafter{\UD@firstoftwo{}#1}%
{\UD@stopromannumeral#1}%
{\expandafter\UD@ExtractFirstArgLoop\expandafter{\UD@RemoveTillFrozenrelax#1}}%
}%
%%=============================================================================
%% USER-MACRO:
%%=============================================================================
\long\def\strlen#1{%
%
\romannumeral\expandafter\tokencountloop\expanded{{#1}}{0}%
%
%{\edef\arg{#1}\expandafter}\expandafter
%\romannumeral\expandafter\tokencountloop\expandafter{\arg}{0}%
}%
%%=============================================================================
%% MAIN LOOP:
%%=============================================================================
\long\def\tokencountloop#1#2{%
\UD@CheckWhetherNull{#1}{\UD@stopromannumeral#2}{%
\expandafter\UD@PassFirstToSecond\expandafter{%
\number\numexpr#2+%
\UD@CheckWhetherBrace{#1}{%
2+% <-If you don't want curly braces to be counted, comment out.
\romannumeral\expandafter\expandafter\expandafter\tokencountloop\UD@ExtractFirstArg{#1}{0}%
}{1}%
\relax
}{%
\UD@CheckWhetherLeadingExplicitSpace{#1}{%
\expandafter\tokencountloop\expandafter{\UD@removespace#1}%
}{%
\expandafter\tokencountloop\expandafter{\UD@firstoftwo{}#1}%
}%
}%
}%
}%
\catcode`\@=12
O tamanho de "..." aberta é :
\strlen{Uma noite em Peron} %%%% outputs 18
\def\frase{Uma noite em Peron}
O tamanho de "\frase" é :
\strlen\frase %%%% outputs 18
\def\frase{Uma\ noite\ em\ Peron}
O tamanho de "\frase" é :
\strlen\frase %%%% outputs 18
O tamanho de "..." aberta é :
\strlen{Uma {{noite} }em Peron} %%%% outputs 22 which might make sense with things like
%%%% \strlen{#2}
%%%% \scantokens\expandafter{\string\verb*#1#2#1}
%%%% where #2 denotes a set of unexpandable explicit character tokens
%%%% and #1 denotes a verb-delimiter which does not occur within #2.
\end
Assuming e-TeX-extensions (\numexpr/\detokenize) and \expanded are not available, the job of counting tokens will not be trivial and cannot be done in terms of expandable methods/routines alone:
In order to expand the argument you need to use \edef for defining a scratch-macro.
- Definig a scratch-macro itself is not an expandable method.
- Before defining the scratch macro every hash (every explicit character token of category 6(parameter)) in the argument needs to be doubled.
I don't see any reliable method for detecting an explicit character token of category 6(parameter) which does without whatsoever TeX-extension.
(If \detokenize was available one could check whether applying \string to the token in question yields a single token while applying \detokenize (due to its doubling of hashes) yields two tokens; however the edge case of there being an explicit character token of character code 32 and category 6 would require special attention...)
\str_count:nin expl3 but it looks like you aren't. – user202729 Nov 17 '22 at 05:33\begingroup\catcode`\ =11 \strlenmain– Ulrike Fischer Nov 17 '22 at 09:28"Furthermore, spaces are not ignored after control sequences inside a token list; the ignore-space rule applies only in an input file, during the time that strings of characters are being tokenized"
Excerpt from a chapter before the exercise 7.3 of TeX's Magna Carta.
So, I'm deeply amiss (unfortunately once more time) something again. Could someone give me a hand?
– Daniel Bandeira Nov 17 '22 at 09:40pdflatexor (Lua|Xe)LaTeX? Do you expect to have macros (say\emphor similar) in the text you want to process? – egreg Nov 17 '22 at 09:53I was believing that TeX expands until find a number and, in a not fouding case, it will recover the macro ("kill my self").
So, in the case "\catcode`\ =11 " (note the ending space), the last space would stay in 10 or 11 category (or would be even eaten)?
– Daniel Bandeira Nov 17 '22 at 10:40\expandafter\strlen\frase. BTW, if you want to count characters, see https://tex.stackexchange.com/questions/233085/basics-of-parsing?r=SearchResults&s=1%7C39.3909 – John Kormylo Nov 17 '22 at 15:08