7

I've recently learned how to create a macro to iterate over a token sequence using \afterassignment and \let in TeX.

Currently, I'm trying to apply this to a tokenization macro (where "tokenization" means splitting a character sequence around a delimiting character/ character-sequence); that is, an algorithm stringTokenize which does the following:

  1. Iterate over character in the string
  2. if character is not the delimiter (for example, ,): add character to buffer; if character is the delimiter, expand buffer and do something with it (that is, pass to a given macro) and clear buffer.

This is not as simple as I thought however, because apparently you can't expand a character which is \let to a command into another command (in order to create a "buffer" structure), for example,

\edef\buffer{\buffer\character}

will (after n iterations) expand to

{\character\character(...)character}

which is n times the character last \let to \character.

How do I do this?


Here's what I have so far:

\documentclass{minimal}

\makeatletter

\def\structures@stringIterate#1#2{%

    \newcount\@@index%

    \def\@@stop{§}
    \def\@@next{\afterassignment\@@each\let\@@current= }
    \def\@@each{%
      \ifx\@@current\@@stop%
         \let\@@next=\relax
      \else%
        \advance\@@index1\relax
         #2{\the\@@index}{\@@current}\let\@@next=\@@next
      \fi%
      \@@next % macro expansion must be delayed until the condition has been evaluated
    }

    \@@next#1\@@stop

}

\def\structures@stringTokenize#1#2#3{%

    \newcount\@@index
    \@@index 0\relax

    \def\@@buffer{}

    \def\@@test##1##2{%
        \ifx#2##2
            \advance\@@index 1\relax
            #3{\the\@@index}{\@@buffer}
            \def\@@buffer{} % reset the buffer
        \else
            \edef\@@buffer{\@@buffer##2} % gather character
        \fi
    }

    \structures@stringIterate{#1}{\@@test}

}

\makeatother

\begin{document}

\makeatletter

    \def\@@print#1#2{token (#1): ``#2''\par}
    \structures@stringTokenize{Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall, All the King's horses and all the King's men, Couldn't put Humpty together again.}{,}{\@@print}

\makeatother

\end{document}
FK82
  • 683
  • 2
    Is there a reason you are doing things this way? The natural approach to grabbing a delimited argument would be \def\foo#1<token>{<do stuff>}. – Joseph Wright Jun 22 '14 at 09:46
  • Also, what are the conditions we need to take account of: can we use e-TeX, do you want an expandable iterator, ...? – Joseph Wright Jun 22 '14 at 09:50
  • Also, you seem to be looking for \futurelet! – Joseph Wright Jun 22 '14 at 09:55
  • @JosephWright No other real reason than learning the TeX way of things. The conditions would be as core TeX as possible. Can you expand on the use of \futurelet in this context? – FK82 Jun 22 '14 at 10:22
  • BTW, what you get out of this process is a series of tokens, not a string. In TeX terms, a 'string' is a series of tokens of category code 12, with the exception of spaces which have category code 10. – Joseph Wright Jun 22 '14 at 11:52
  • @JosephWright Yes, that's exactly the problem. The result of my code is a series of tokens \character\character(...)\character where the respective string would be Hu(...).. Since \character will (correctly) be assigned to the last character reached within the loop, the result is n times the expansion of \character (which is not the desired result). – FK82 Jun 22 '14 at 11:57
  • BTW, there are various issues with your code not related to the matter in hand. You shouldn't have blank lines in your definitions (mentioned in comments on http://tex.stackexchange.com/questions/184438/how-do-you-define-macros-to-toggle-a-character-active-and-define-a-macro-for-it), you shouldn't have \newcount\@@index twice: really should be outside your macros, and you need to watch your line ends (some important % missing). – Joseph Wright Jun 22 '14 at 15:18

1 Answers1

8

The usual approach to grabbing tokens with a delimiter in TeX would be something like

\catcode`\@=11 %

\def\grabargs#1#2#3{%
  \def\grabargs@aux##1#2{%
    \ifx\relax##1\relax%
    \else
      #3{##1}\par
      \expandafter\grabargs@aux
    \fi
  }%
  \grabargs@aux#1#2\relax#2
}
\def\print#1{`{\tt #1}'}
\catcode`\@=12 %

\grabargs
  {Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall, All the King's horses and all the King's men, Couldn't put Humpty together again.}
  {,}
  {\print}

\bye

where depending how complex things need to be we can use a more fancy approach to stopping the loop.

What you can't do is use \let here (at least in any particularly easy way): this creates implicit tokens which as you've found can be expanded in, for example, \edef. What you might be looking to do is pick up space tokens: they are tricky to grab as a normal argument (#1), so you might use \futurelet in something like

\catcode`\@=11 %

\newcount\mycount
\def\grabargs#1#2#3{%
  \def\grabbed@tokens{}%
  \mycount\z@
  \def\grabargs@auxii{%
    \ifx\grab@token#2%
      \grabargs@auxiv
      \def\grabbed@tokens{}%
      \expandafter\grabargs@auxv
    \else
      \ifx\grab@token\@sptoken
        \expandafter\expandafter\expandafter\grabargs@space
      \else
        \ifx\grab@token\relax
        \else
          \expandafter\expandafter\expandafter\expandafter\expandafter
            \expandafter\expandafter\grabargs@auxiii
        \fi
      \fi
    \fi
  }%
  \def\grabargs@auxiv{%
    #3{\the\mycount}\grabbed@tokens
    \mycount\z@
  }%
  \grabargs@auxi#1#2\relax
}
\def\grabargs@auxi{%
  \futurelet\grab@token\grabargs@auxii
}
\def\grabargs@auxiii#1{%
  \edef\grabbed@tokens{%
    \grabbed@tokens
    #1%
  }%
  \advance\mycount\@ne
  \grabargs@auxi
}
\def\grabargs@auxv#1{\grabargs@auxi}
\expandafter\def\expandafter\grabargs@space\space{%
  \grabargs@auxiii\space
}
\def\:{\let\@sptoken= } \:  %
\def\print#1#2{Tokens: #1 `{\tt #2}'\par}
\catcode`\@=12 %

\grabargs
  {Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall, All the King's horses and all the King's men, Couldn't put Humpty together again.}
  {,}
  {\print}

\bye

The idea here is that \futurelet is very similar to your \afterassignment\@@each\let\@@current= except it doesn't remove the token from the input stream. We grab a token using \futurelet then can examine it. If it's the marker, we can print the current grabbed tokens (and count), then loop. If it's the end marker, we stop, and if it's a 'normal' token we can grab the argument and store. If it's a space, we do a special version of the 'grab' macro which will handle a space. What I've not added in the above is code to deal with brace groups: that can be done but the approach depends on what you want to do about such things.


Note that in both of the above implementations we are storing tokens, which could mean issues with category code 6 tokens (usually #) and how you want to manage line breaks, etc. This can all be sorted out, but without a tighter spec I'm leaving it alone. Also, note that it's easy to get caught out by \futurelet in an \halign: see for example Where do I find \futurelet's nasty behaviour documented?.

Joseph Wright
  • 259,911
  • 34
  • 706
  • 1,036
  • \futurelet will always expand the second token it receives, right? That's undesirable in this case, because I want to leave the decision whether or not to print the tokenized sub-string to the document to the macro passed as the third argument to structures@stringTokenize. Probably should have stated that more clearly. Sorry! – FK82 Jun 22 '14 at 12:09
  • @FK82 No, \futurelet does an assignment to the its first argument then executes its second argument (it's one of the most complex primitives in TeX!). The second arg of \futurelet is much like your \@@each in the question. As your demo uses a printing operation I have too: change \print to a no-op to see this. – Joseph Wright Jun 22 '14 at 12:15
  • @FK82 Also, as I've commented on the question you don't have a string here (and it's already tokenized): you've got a token list. – Joseph Wright Jun 22 '14 at 12:18
  • Where do I not have a string? – FK82 Jun 22 '14 at 14:17
  • @FK82 Everywhere: you've not detokenized anything. In TeX we deal with tokens which have category codes: for example a is usually a 'letter' (catcode 11) while # is a 'macro parameter' (catcode 6) and { is an 'open group' token (catcode 1). Note that something like \relax is one token, again assume 'usual' circumstances. TeX 'strings' are generated by for example \meaning or \string: they have category code 12 ('other') for all characters except for " " (space) which have category code 10 ('space'). – Joseph Wright Jun 22 '14 at 14:22
  • Yeah, well, this amounts to saying there are no strings because everything is a token (of differing categories). – FK82 Jun 22 '14 at 16:05
  • 1
    @FK82 Not exactly: one can treat things without using catcode (see for example \if versus \ifx). I'm just trying to get across that in TeX a 'string' is a particular thing which is not the same as a 'token list', and it's important to understand the difference. – Joseph Wright Jun 22 '14 at 16:08
  • I see, seems like I have to learn a bit more about basic concepts of TeX. – FK82 Jun 22 '14 at 17:31