More efficient string extraction of words

Question

I want to execute some code if some words are present in a latex3 string. I came up with my own implementation, basically doing a loop using \str_map_inline and keeping track of the last part of the current word using \str_put_right, but it turns out to be like 500x slower than what I would expect (comparing with \str_if_in:NnTF that should roughly do the same amount of operations), which makes my overall library 20% slower for one tiny operation. Any idea what I did wrong?

MWE:

\documentclass{article}
\usepackage{l3benchmark}
\begin{document}
Test
\ExplSyntaxOn
%%%%%%%%%%%%%% Library to make more efficient
% __robExt_auto_forward_words:N \commandToRunOnEachWord \stringToSearchOn
\cs_set:Nn __robExt_auto_forward_words:NN {
  % \l_tmpa_str will contain the current word read so far
  \str_set:Nn \l_tmpa_str {}%
  \str_map_inline:Nn #2 {
    % \token_case_charcode:NnTF ##1 {} {} {}
    __robExt_if_letter:nTF {##1} {
      \str_put_right:Nn \l_tmpa_str {##1}
    }{
      \str_if_empty:NTF \l_tmpa_str { } {
        % if the string is empty, we run the command on the string
        #1 \l_tmpa_str%
        \str_set:Nn \l_tmpa_str {}% we reset its value
      }
    }
  }
}
%% __robExt_if_letter:nTF {char} {true} {false} tests if an element is a letter
%% https://tex.stackexchange.com/a/700864/116348
\prg_new_conditional:Npnn __robExt_if_letter:n #1 { TF }
{
  \bool_lazy_or:nnTF
  {
    \bool_lazy_and_p:nn
    { \int_compare_p:nNn { #1 } &gt; {a - 1 } }
    { \int_compare_p:nNn { #1 } &lt; {z + 1 } }
  }
  {
    \bool_lazy_and_p:nn
    { \int_compare_p:nNn { #1 } &gt; {A - 1 } }
    { \int_compare_p:nNn { #1 } &lt; {Z + 1 } }
  }
  \prg_return_true:
  \prg_return_false:
}
% \robExt_register_match_word {namespace that defaults to empty} {word} {code to run if word is present} 
\cs_set:Nn \robExt_register_match_word:nnn {
  \cs_set:cn {l__robExt_execute_if_word_present_#1_#2:} {#3}
}
% \robExt_try_to_execute_if_match_word:nn {namespace} {word}
\cs_set:Nn \robExt_try_to_execute_if_match_word:nn {
  \cs_if_exist:cTF {l__robExt_execute_if_word_present_#1_#2:} {%
    \cs_if_exist:cTF {l__robExt_execute_if_word_present_#1_#2__already_forwarded:}{\message{Already forwarded}}{
      \use:c {l__robExt_execute_if_word_present_#1_#2:}%
      % define it so that we do not import twice next time
      \cs_set:cx {l__robExt_execute_if_word_present_#1_#2__already_forwarded:} {}
    }
  } { }
}
\cs_generate_variant:Nn \robExt_try_to_execute_if_match_word:nn { nV }
%%%%%%%%%%%%%% Usage
\robExt_register_match_word:nnn {} {grapes} {I~like~grapes.\}
\robExt_register_match_word:nnn {} {grapefruits} {In~hate~grapefruits.\}
%% This string is already created for other reasons, so you can safely assume it exists
\str_new:N \l_my_str
\str_set:Nn \l_my_str {In~the~market~you~can~find~some~grapes~and~grapefruits.}
My~string~is~''\l_my_str''.\newline
\NewDocumentCommand{\testAutoForward}{}{
  \cs_set:Nn __robExt_tmp_fct:N {
    \message{I will try to run ##1}
    \robExt_try_to_execute_if_match_word:nV {} ##1
  }
  __robExt_auto_forward_words:NN __robExt_tmp_fct:N \l_my_str
}
\cs_new:Nn \robExt_benchmark_me:n {
  \benchmark:n {#1}
  Number~of~operations~taken~by:\par\texttt{\detokenize{#1}}\par~is~\fp_to_scientific:N\g_benchmark_ops_fp.
  Time~taken~by:\par\texttt{\detokenize{#1}}\par is~\fp_to_scientific:N\g_benchmark_time_fp.
}
\fp_new:N \l_robExt_fp
\fp_set_eq:NN \l_robExt_fp \g_benchmark_time_fp
\robExt_benchmark_me:n {\testAutoForward}
\par Second test (reference time I'd like to reach):\par
\robExt_benchmark_me:n {
  \str_if_in:NnTF \l_my_str {grapes}{%
    % Not sure why I cannot print this with getting "TeX capacity exceeded", I guess because it repeats it a lot?
    % I~like~grapes.
  }{}
  \str_if_in:NnTF \l_my_str {grapefruits}{}{}
}
% Not sure why this prints "ERROR: Use of ??? doesn't match its definition."
% The~reference~implementation~is~\fp_eval:n{(\g_benchmark_time_fp) / (\l_robExt_fp)}~times~faster.
\ExplSyntaxOff
\end{document}

EDIT

To answer the comments, more precisely, I have a string (latex3, i.e. everything should be chatcode other or space I think) \mystring, and I want to extract all words ([a-zA-Z]+) to run the corresponding some code that might have been registered before via \registerWord{myWord}{some code}. So, if \mystring contains:

In the market you can find some grapes, apples, and grapefruits.

and if I ran \registerWord{grapes}{\message{I like grapes}}, then running \extractAndExecuteWords \mystring should run \message{I like grapes}.

My first try with normal latex (but multiple issues: spaces are removed from the string, and I can't find how to insert braces in the macro, so I insert bgroups but it is not equivalent and How can I add a single curly brace to a macro? gives me weird errors):

\documentclass{article}
\begin{document}
\ExplSyntaxOn
\str_new:N \l_my_str
\str_set:Nn \l_my_str {In~the~market~you~can~find~some~grapes, apples,~and~grapefruits.}
\let\myString\l_my_str
\ExplSyntaxOff
\makeatletter
% \autoForwardWords \stringToSearchOn
\def\autoForwardWords#1#2{%
  \def\robExt@tmp@word{}%
  \let\robExt@cmd@to@run#1%
  \message{AAAAAAAAA #2}%
  \edef\robExt@list@of@commands{%
    \noexpand\robExt@cmd@to@run\noexpand\bgroup%
    \expandafter\autoForwardWords@aux#2\robExt@end@of@string% \autoForwardWords@aux is the end of the string
  }%
  %% This shows the command to run, with two issues:
  %% 1) it removed spaces in the string
  %% 2) I can't find how to add braces instead of bgroups.
  %%    I tried https://tex.stackexchange.com/questions/506613/how-can-i-add-a-single-curly-brace-to-a-macro
  %%    but I was getting errors.
  %%\show\robExt@list@of@commands
  \robExt@list@of@commands
}
\def\autoForwardWords@aux#1{%
  \ifx#1\robExt@end@of@string% We arrived at the end of the string
    \noexpand\bgroup%
  \else%
    \ifnum#1&gt;\numexpra-1\relax%
      \ifnum#1&lt;\numexprz+1\relax%
        #1%
      \else%
        \noexpand\egroup\noexpand\robExt@cmd@to@run\noexpand\bgroup%
      \fi%
    \else%
      \ifnum#1&gt;\numexprA-1\relax%
        \ifnum#1&lt;\numexprZ+1\relax%
          #1%
        \else%
          \noexpand\egroup\noexpand\robExt@cmd@to@run\noexpand\bgroup%
        \fi%
      \else%
        \noexpand\egroup\noexpand\robExt@cmd@to@run\noexpand\bgroup%
      \fi%

    \fi%
    \expandafter\autoForwardWords@aux% let it grap the next character
  \fi%
}
\def\robExt@end@of@string{}
\def\printWord#1{I saw --((#1))--.}
\autoForwardWords\printWord\myString
\makeatother
\end{document}

You want to do two things here: split based on spaces (which tell you if things are words), and use a f-type expansion approach to avoid assignments. I'll see if I can write something later, but you might want to look at the implementation of \text_lowercase:n (in general terms). — Joseph Wright, Nov 15 '23 at 16:53
Actially my notion of word is any group of a-zA-Z letter, so they might be separated by commas, braces etc, not just space. I'll give it a look, but I tried to look at the implementation of str_is_in but it was quite mysterious to me. — tobiasBora, Nov 15 '23 at 18:40
If you will accept a solution without using expl3 (which will be probably more efficient) then you should exactly describe, what do you want to do. The mention of macro names from expl3 isn't sufficient because I don't understand your expl3 code and I don't want to study expl3 documentation. — wipet, Nov 15 '23 at 18:53
@wipet sure, I’m fine with non-expl3 code. I gave it a try following the spirit of your previous answer but with some artistic tries to make it work on strings… but I got stuck by spaces in string and braces in command, see my edit. — tobiasBora, Nov 16 '23 at 01:27
If you run \documentclass{article} \usepackage{listofitems} \ignoreemptyitems \begin{document} \def\mystring{In the market you can find some grapes, apples, and grapefruits.} \setsepchar[+]{ ||,||.||?||!||;||:||'||/}% define punctuations% \readlist\myterms\mystring \foreachitem\z\in\myterms[]{``\z'' } \end{document}, you will see that the listofitems package can parse input, using defined punctuation and spaces as delimiters. The array \myterms[<n>] gives the n'th word in your input. The extraction should, I believe, be fast. — Steven B. Segletes, Nov 16 '23 at 01:58

wipet · Accepted Answer · 2023-11-17T05:58:23.627

You can try this code:

\let\ea=\expandafter
\def\scanmacro#1{%
   \bgroup \settocomma { };:."?!@+=\{\}\relax
   \lowercase\ea{\ea\gdef\ea#1\ea{#1}}%
   \edef#1{\detokenize\ea{#1}}%
%  \message{\string#1: \meaning#1} % prints the modified format of the scanned macro
   \ea\egroup
   \ea\wordscan#1,\relax,%
}
\def\settocomma #1{\ifx\relax#1\else \lccode`#1=`, \ea\settocomma\fi}
\def\wordscan#1,{\ifx\relax#1\empty\else 
%  \message{{#1}}  % prints each scanned "word"
   \ifcsname doword:#1\endcsname \csname doword:#1\endcsname \fi
   \ea\wordscan\fi
}
\def\regword#1#2{\ea\gdef\csname doword:\string#1\endcsname{#2}}
\regword {grapes}  {\message{I like grapes.}}
\regword {find}    {\message{We are searching somewhat.}}
\def\mystring{In the {market} you can find some grapes, apples? and grapefruits.}
\scanmacro\mystring % runs \message{We are seachring somewhat.}
                    % and \message{I like grapes.}

We replace all occurrences of different characters than letters by comma using \lowercase and re-set catcodes of such commas to "normal comma" using \detokenize. So, the macro

In the {market} you can find some grapes, apples? and grapefruits.

looks like this after its modification:

in,the„market„you,can,find,some,grapes„apples„and,grapefruits,

Then we scan such a macro by \scanword with a parameter #1 separated by comma and process each scanned word individually. Note, there are several "empty words". This is no problem because empty word isn't registered. Removing the occurrences of ,, before \scanword adds more useless computation time.

You must to write all characters different from letters (which you expect to be used in scanned macros) to the list of characters after \settocomma finalized by \relax. Note that the first { } means space and the last \{\} means { and }, so they are replaced to commas too.

Only control sequences inside the \mymacro are unsolved in this code. We suppose, that they are not present here. If it is not true then you have to add second

\edef#1{\detokenize\ea{#1}}%

just before \lowercase. And you can decide if the \word should be interpreted like word (add \\ to the list of "to comma" characters) or should be ignored (don't add \\ to "to comma"). In this second case you can register \word differently from word.

EDIT

Due to your comment about keeping uppercase letters I created another approach which doesn't use \lowercase, but runs a macro token-per token in order to replace non-letter characters to comma. The advantage of this approach is that you needn't run a macro over a list of "others characters" (which can be very large) and you needn't run a macro over a list of all uppercase letters (which can be very large in Unicode set too). The disadvantage is that token-per-token macro processing is probably less efficient than \lowercase.

\let\ea=\expandafter
\def\scanmacro#1{%
   \bgroup 
   \edef#1{\detokenize\ea{#1}}%
   \edef#1{\ea\replspace#1 \relax}%  replaces spaces to comma
   \edef#1{\ea\replothers#1\relax}%  replaces other characters to comma
%   \message{\string#1: \meaning#1} % prints the modified format
   \ea\egroup \ea\wordscan#1,\relax,%
}
\def\replspace#1 #2{#1\ifx#2\relax \else ,#2\ea\replspace\fi}
\def\replothers#1{\ifx#1\relax\else \ifnum\lccode`#1=0 ,\else #1\fi \ea\replothers\fi}
\def\wordscan#1,{\ifx\relax#1\empty\else 
%  \message{{#1}}  % prints each scanned "word"
   \ifcsname doword:#1\endcsname \csname doword:#1\endcsname \fi
   \ea\wordscan\fi
}
\def\regword#1#2{\ea\gdef\csname doword:\string#1\endcsname{#2}}
\regword {grapes}  {\message{I like grapes}}
\regword {find}    {\message{We are searching somewhat}}
\def\mystring{In the {market} you can find some grapes, apples? and grapefruits.}
\scanmacro\mystring % runs \message{We are seachring somewhat}
                    % and \message{I like grapes}
\bye

The main concept is the same: we replace spaces and non-letter characters to comma and run \wordscan. We recognize non-letters characters as characters with their \lccode equal to zero.

Thanks a lot, this is way more efficient, barely noticeable in my library! (maybe 20% less efficient than \str_if_in, which is the good order of magnitude I’d expect) And this is a really smart idea to let the macro system automatically grab all text until the next ,, and using \lowercase is a nice clever trick. If I want to preserve the uppercase/lower case stuff of letters, I would need to write \lccode`L=`L for all letters? (possibly with a loop maybe) — tobiasBora, Nov 16 '23 at 23:25
@tobiasBora Yes, you have to process \lccode over all uppercase letters but it can be a huge set it we support all Unicode, so this approach isn't very effective. So, I created another example in my edit of my answer. — wipet, Nov 17 '23 at 06:01
If I understand (from its name) what \str_if_in does: it searches an individual "word" from a given token list. If you have more registered words then you have to repeat \str_if_in for each such a word. So, we cannot compare efficiency of single run of \str_if_in with searching all registered words using macro presented here. — wipet, Nov 17 '23 at 06:08
Thanks a lot, it's very helpful to read that code! So I was comparing with a single run of \str_if_in just to get an order of magnitude of the time that such a macro should take to run, since \str_if_in must also loop over all chars of the string. But of course to be usable I would need to repeat it for each word I want to replace, yet with my old solution it was still interesting when replacing less than 500 words. Now it starts to be less interesting after only 2 words. — tobiasBora, Nov 17 '23 at 07:36

More efficient string extraction of words

1 Answers1