How can I parse the first word in a token stream, token by token?

Question

Some programming editors and IDEs highlight legal and illegal variable names in different ways. I wish listings easily allowed for such syntax checking and highlighting, but it doesn't, at least at the moment. My longer-term objective is to implement a solution compatible with listings but, for now, I'm looking at a very simplified version of the problem.

I want to parse the first word in a token stream composed only of character and space tokens. This has to be done token by token, though, because I have to perform some checks on each token. For that, I use a recursive macro that parses the stream, saving it in a \Word macro, until a . token is encountered. At that stage, I process \Word in a certain way, such as printing it in red. I define a word as a sequence of character tokens uninterrupted by any space token.

Problem: I only use a . token here because I can't figure out what I should change in my code in order to stop the recursion when the next space token, not a . token, is encountered in the stream. Substituting a control space (\) for . doesn't seem to do the trick. What should I change in my code?

I favour a solution using low-level TeX commands for the parsing, but LaTeX2e and LaTeX3 alternatives are also of interest to me.

Edit: I apologise if I seem to be moving the goalpost, but I had to clarify my question quite a bit by adding the "token by token" requirement, which may invalidate some of the answers already posted.

enter image description here

\documentclass{article}

\usepackage{xcolor}

\def\ParseWordandPrint#1{%
    \if #1.%
        \textcolor{red}{\Word}\ %
    \else
        \edef\Word{\Word#1}
        \expandafter\ParseWordandPrint
    \fi%
}

\def\InitParseWordandPrint#1{%
    \def\Word{}
    \ParseWordandPrint#1%
}

\begin{document}

\InitParseWordandPrint Hello.World

\end{document}

You can surely add the token by token processing when you have the various items stored in a sequence. It mostly depends on what you want to do during this processing. — egreg, Dec 21 '13 at 21:43

egreg · Answer 1 · 2013-12-21T22:30:20.540

It's really easy with LaTeX3:

\documentclass{article}
\usepackage{xparse,xcolor}

\ExplSyntaxOn
\NewDocumentCommand{\printfirstwordincolor}{ O{red} m }
 {
  \jubobs_pfwr:nn { #1 } { #2 }
 }

\seq_new:N \l_jubobs_words_seq
\tl_new:N \l_jubobs_first_word_tl

\cs_new_protected:Npn \jubobs_pfwr:nn #1 #2
 {
  % split the input at spaces
  \seq_set_split:Nnn \l_jubobs_words_seq { ~ } { #2 }
  % pop off the leftmost item
  \seq_pop_left:NN \l_jubobs_words_seq \l_jubobs_first_word_tl
  % print the first item in the chosen color
  \textcolor{#1}{ \l_jubobs_first_word_tl } ~ %
  % print the other items adding spaces between them
  \seq_use:Nn \l_jubobs_words_seq { ~ }
 }

\ExplSyntaxOff

\begin{document}

\printfirstwordincolor{Hello World}

\printfirstwordincolor[green]{Addio mondo crudele}

\end{document}

enter image description here

If you also want to process the input token by token, you can do a mapping on the saved items. Let's say you want to capitalize every ‘d’:

\documentclass{article}
\usepackage{xparse,xcolor}

\ExplSyntaxOn
\NewDocumentCommand{\printfirstwordincolor}{ O{red} m }
 {
  \jubobs_pfwc:nn { #1 } { #2 }
 }

\seq_new:N \l_jubobs_words_seq
\tl_new:N \l_jubobs_first_word_tl
\bool_new:N \l_jubobs_first_item_bool

\cs_new_protected:Npn \jubobs_pfwc:nn #1 #2
 {
  \seq_set_split:Nnn \l_jubobs_words_seq { ~ } { #2 }
  \seq_pop_left:NN \l_jubobs_words_seq \l_jubobs_first_word_tl
  \textcolor{#1}{ \l_jubobs_first_word_tl } ~ %
  \seq_use:Nn \l_jubobs_words_seq { ~ }
 }

\NewDocumentCommand{\printfirstwordincolorandcapitalizeD} { O{red} m }
 {
  \jubobs_pfwcacd:nn { #1 } { #2 }
 }

\cs_new_protected:Npn \jubobs_pfwcacd:nn #1 #2
 {
  \seq_set_split:Nnn \l_jubobs_words_seq { ~ } { #2 }
  \leavevmode
  \bool_set_true:N \l_jubobs_first_item_bool
  \seq_map_inline:Nn \l_jubobs_words_seq
   {
    \bool_if:NT \l_jubobs_first_item_bool
     { \c_group_begin_token \color{#1} }
    \tl_map_inline:nn { ##1 }
     {
      \peek_charcode_remove:NT d { D } ####1
     }
    \bool_if:NT \l_jubobs_first_item_bool
     { \c_group_end_token \bool_set_false:N \l_jubobs_first_item_bool }
    \c_space_tl
   }
  \unskip
 }
\ExplSyntaxOff

\begin{document}

\printfirstwordincolor{Hello World}

\printfirstwordincolorandcapitalizeD[blue]{Addio mondo crudele}

\end{document}

enter image description here

I'm still not very familiar with LaTeX3 syntax, but the comments you left in your code will be helpful. Thanks for this LaTeX3 alternative. — jub0bs, Dec 21 '13 at 18:48

score 6 · Answer 2 · answered Dec 21 '13 at 19:44

TeX's \def provides delimited parameter text. So, in this case, you can use the space as a delimiter:

enter image description here

\documentclass{article}

\usepackage{xcolor}

\def\ParseWordandPrint#1{%
    \if #1.%
        \textcolor{red}{\Word}\ %
    \else%
        \edef\Word{\Word#1}%
        \expandafter\ParseWordandPrint%
    \fi%
}

\def\InitParseWordandPrint#1{%
    \def\Word{}%
    \ParseWordandPrint#1%
}
\def\highlightfirst#1 {\textcolor{red}{#1} }

\begin{document}

\InitParseWordandPrint Hello.World

\highlightfirst Hello World.

\end{document}

Your approach solves my simplified problem. However, I really need to parse a word token by token in order to check whether it be a legal variable name or not, and I don't think I can use your approach for that. I've clarified my question. — jub0bs, Dec 21 '13 at 20:51

David Carlisle · Accepted Answer · 2013-12-22T00:21:46.470

Undeleting this in light of your edited question.

It is easier to grab the word in one go with a space delimited argument and then iterate over that letter by letter. (As an example I check here that they are letters, the last example produces

illegal character 0
illegal character 1

To iterate character by character stopping at a space you tent to have to use \futurelet (or equivalently \@ifnextchar) but taht's not so good in a word you are going to typeset as it's hard not to break inter-letter kerns and ligatures. So it is easier to grab the word first.

\documentclass{article}

\usepackage{xcolor}

\def\InitParseWordandPrint#1 {\check#1\relax\textcolor{red}{#1} }

\def\check#1{%
\ifx\relax#1%
\else
\ifcat a#1%
\else
\typeout{illegal character #1}%
\fi
\expandafter\check
\fi}

\begin{document}

\InitParseWordandPrint Hello World


\InitParseWordandPrint  World

Hello World

\InitParseWordandPrint  W0r1d 

\end{document}

Thanks, David. I think you deserve the checkmark because I was more in favour of a TeX solution. — jub0bs, Dec 22 '13 at 13:29

How can I parse the first word in a token stream, token by token?

3 Answers3

Linked