7

I have a string that I want to parse into to numbers and non-numbers.

For my purposes:

A Number can EITHER be any sequential string of digits OR sequential string of digits with a . followed by another sequential string.

A Non-Number is anything that is not a Number.

For example

ljksadflh23898129hfafh0324.22234

should be parsed into something like

ljksadflh, 23898129, hfafh, 0324.22234

or

ljksadflh/23898129/hfafh/0324.22234

or whatever floats your boat as long as the list retains the same ordering.

Marco Daniel
  • 95,681
Uiy
  • 6,132
  • I believe one could just insert a deliminator at every first instance of a number then split the string(if a . occurs before a number it would not be inserted). I would then need to convert the stringed numbers into usable numbers – Uiy Mar 16 '12 at 13:11
  • 1
    Do you care about edge conditions like 123.456.789? It seems it could parse to either 123.456 / . / 789 or 123 / . / 456.789 – StrongBad Mar 16 '12 at 13:42
  • no, not really. – Uiy Mar 16 '12 at 18:57

4 Answers4

10

This one works purely by expansion, so is safe in write etc:

$ tex split
This is TeX, Version 3.1415926 (Web2C 2010)
(./split.tex
ljksadflh, 23898129, hfafh, 0324.22234
6, ljksadflh, 23898129, hfafh, 0324.22234
 )
No pages of output.
Transcript written on split.log.




\def\xytest#1{?%
\ifnum`#1<46?\else
     \ifnum`#1>57?\else
     \ifnum`#1=47?\else!\fi\fi\fi}

\def\x#1{%
 \ifx\relax#1%
 \else
 \if\xytest#1%
 #1\expandafter\expandafter\expandafter\xx
 \else
 #1\expandafter\expandafter\expandafter\y
 \fi
 \fi}

\def\xx#1{%
 \ifx\relax#1%
 \else
 \if\xytest#1%
 #1\expandafter\expandafter\expandafter\xx
 \else
 , #1\expandafter\expandafter\expandafter\y
 \fi
 \fi}


\def\y#1{%
 \ifx\relax#1%
 \else
 \if\xytest#1%
 , #1\expandafter\expandafter\expandafter\xx
 \else
 #1\expandafter\expandafter\expandafter\y
 \fi
 \fi}


\immediate\write20{\x ljksadflh23898129hfafh0324.22234\relax}

\immediate\write20{\x 6ljksadflh23898129hfafh0324.22234\relax}


\bye
David Carlisle
  • 757,742
10

With the experimental (but pretty much ready for release) package l3regex (found in the l3experimental bundle on CTAN), this task is a piece of cake.

\documentclass{article}
\usepackage{l3regex,xparse}
\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
  {
    \regex_extract_all:nnN { \D+ | \d+(?:\.\d*)? } {#1} \l_uiy_result_seq
    \seq_map_inline:Nn \l_uiy_result_seq { item:~##1\par }
  }
\ExplSyntaxOff
\begin{document}
  \UiySplit{ljksadflh23898129hfafh0324.22234}
\end{document}

The \regex line splits the user input #1 into pieces which either consist of one or more (+) non-digits (\D), or (|) of one or more digits (\d), followed maybe (? acting on the group (...), which we want to be "non-capturing", done using (?:...)) by a dot (\. escaped dot, because the dot has a special meaning) and zero or more digits (\d*). The line below maps through all the matches we found, with ##1 being a single match. Of course, you can do whatever you want to do with the items of the sequence \l_uiy_result_seq.

Edit: The module also provides regular expression replacements. If I remember the syntax correctly, the following should work.

\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
  {
    \tl_set:Nn \l_uiy_result_tl {#1}
    \regex_replace_all:nnN
        { (\D+) (\d+(\.\d*)) }
        { \c{uiy_do:nn} \cB{\1\cE} \cB{\2\cE} }
        \l_uiy_result_tl
    \tl_use:N \l_uiy_result_tl
  }
\cs_new_protected:Npn \uiy_do:nn #1#2 { \use:c {#1} {#2} }
\ExplSyntaxOff

This time, I catch both the sequence of non-digits, and the number, as captured groups, \1 and \2. Each such occurrence is replaced by the macro \uiy_do:nn (the \c escape in this case indicates "build a comman"), then a begin-group (\cB) character { (this time, \c indicates the category code), then the non-digits (\1), then an end-group (\cE) character }, then another \cB{, the number, and a closing \cE}.

After that, the token list looks like \uiy_do:nn {ljksadflh} {1}. We then simply use its contents with \tl_use:N. The final step is to actually define \uiy_do:nn. Here, I defined it as simply building a command from #1, and giving it the argument #2. This very simple action could be done at the replacement step using \c{\1} for "build a command from the contents of group \1", and technically it would be slightly better, producing an "undefined control sequence" error if the relevant command is not defined. Another option for that error detection to happen is to replace \use:c {#1} {#2} by \cs_if_exist_use:cF {#1} { \msg_error:nnx { uiy } { undefined-command } } {#2}, with an appropriately defined error message.

  • 1
    This taste lie black forest cake:) Thanks for the regex. Great package. – yannisl Mar 17 '12 at 02:30
  • I like this method as it is much easier to extend by just changing the regexp. I see that latex3 has the ability to map tokens to functions... well, only one function. Is it possible to map a token list to a "function list". Basically I would like to extend your macro to map non-numbers to macros. Basically my strings are pairs. – Uiy Mar 17 '12 at 05:18
  • @Uiy: So you want to replace nondigits123.456moreletters34.56 by \nondigits{123.456}\moreletters{34.56}? – Bruno Le Floch Mar 17 '12 at 05:20
  • @YiannisLazarides Thanks. That (and some string conversion functions) was 4 months of my life :). – Bruno Le Floch Mar 17 '12 at 05:21
  • @BrunoLeFloch Yep! I may want to expand the names first. Instead of nondigits123.456moreletters34.56 I might want to use nd123.45ml3456 but still expand to what you have. (So first I'll map the condensed names to full names then to actual macro commands. – Uiy Mar 17 '12 at 05:33
  • 2
    @BrunoLeFloch This is the good part of life, enjoy it! Better than working on some obscure thesis:) – yannisl Mar 17 '12 at 06:54
7
\documentclass{article}
\usepackage{xparse}
\ExplSyntaxOn
% Document commands
% \NewDocumentCommand\parsestring { m } { \uiy_parse_string:n { #1 } }

\NewDocumentCommand\MyMacro { s O{0} m }
  {
   \IfBooleanTF { #1 }
     {
      \uiy_parse_string:n { #3 }
      \uiy_my_macro_parsed:n { #2 }
     }
     {
      \clist_set:Nn \l_uiy_parsed_string_clist { #3 }
      \uiy_my_macro_parsed:n { #2 }
     }
  }

% Inner commands
\tl_const:Nn \c_uiy_numbers_tl {0123456789.}
\tl_new:N \l_uiy_parsed_string_tl
\clist_new:N \l_uiy_parsed_string_clist
\seq_new:N \l_uiy_main_seq
\bool_new:N \l_uiy_previous_is_number_bool
\cs_generate_variant:Nn \tl_if_in:NnTF {NV}

\cs_new:Npn \uiy_parse_string:n #1
  {
   \tl_clear:N \l_uiy_parsed_string_tl
   \seq_set_split:Nnn \l_uiy_main_seq {} { #1 }
   \seq_pop_left:NN \l_uiy_main_seq \l_uiy_parsed_string_tl
   \tl_if_in:NVTF \c_uiy_numbers_tl \l_uiy_parsed_string_tl
     { \bool_set_true:N \l_uiy_previous_is_number_bool }
     { \bool_set_false:N \l_uiy_previous_is_number_bool }
   \seq_map_inline:Nn \l_uiy_main_seq
     {
      \tl_if_in:NnTF \c_uiy_numbers_tl { ##1 }
        {
         \bool_if:NF \l_uiy_previous_is_number_bool
           { \tl_put_right:Nn \l_uiy_parsed_string_tl { , } }
         \tl_put_right:Nn \l_uiy_parsed_string_tl { ##1 }
         \bool_set_true:N \l_uiy_previous_is_number_bool
        }
        {
         \bool_if:NT \l_uiy_previous_is_number_bool   
           { \tl_put_right:Nn \l_uiy_parsed_string_tl { , } }
         \tl_put_right:Nn \l_uiy_parsed_string_tl { ##1 } 
         \bool_set_false:N \l_uiy_previous_is_number_bool
        }
     }
   \clist_set:NV \l_uiy_parsed_string_clist \l_uiy_parsed_string_tl
  }

\cs_new:Npn \uiy_my_macro_parsed:n #1
  {
   \int_compare:nTF { #1 = 0 }
     {
      \clist_map_inline:Nn \l_uiy_parsed_string_clist
        {
         % here you do something with the successive items
         item: ~  ##1 \par
        }
     }
     {
      % here you do something with the #1-th item
      \clist_item:Nn \l_uiy_parsed_string_clist { #1 - 1 } \par
     }
  }
\ExplSyntaxOff

\begin{document}

1. \MyMacro*{ljksadflh23898129hfafh0324.22234}

\bigskip

2. \MyMacro{ljksadflh,23898129,hfafh,0324.22234}

\bigskip

\bigskip

3. \MyMacro*[2]{ljksadflh23898129hfafh0324.22234}

\bigskip

4. \MyMacro[2]{ljksadflh,23898129,hfafh,0324.22234}

\end{document}

With this you see that an unparsed string is called with the *-variant; the optional argument tells to access only one item of the list. Of course it's not possible to say more without knowing the intended usage of the data.

enter image description here

egreg
  • 1,121,712
  • can you show me how to pass this comma dileminated string as an array to a macro? So \MyMacro{\parsestring{ljksadflh23898129hfafh0324.22234}} will be the same as \MyMacro{ljksadflh,23898129,hfafh,0324.22234}? – Uiy Mar 16 '12 at 19:50
  • Also, is there a way to simply get at the elements. I want to access elements in the list analogous to a array in most procedural languages. e.g., myarray[0] = ljksadflh, myarray[1] = 23898129, etc... – Uiy Mar 16 '12 at 20:00
  • It all depends on what your \MyMacro is supposed to do; if you want to pass it the unparsed string, then define it in a suitable way (I can't say how, of course, since I don't know what it is to be doing). 2. Yes, when the parsed list is in a token list variable it's very easy to get at the n-th item. But in this case it would be better to make it into a "clist" variable.
  • – egreg Mar 16 '12 at 20:29
  • It doesn't matter what \MyMacro is doing. I simply want to pass each element of the list you created to it so that #1 = ljksadflh, #2 = 23898129, etc... – Uiy Mar 16 '12 at 20:32
  • I simply want \MyMacro{\parsestring{ljksadflh23898129hfafh0324.22234}} to be functionally equivalent to \MyMacro{ljksadflh,23898129,hfafh,0324.22234} – Uiy Mar 16 '12 at 20:33
  • @Uiy I don't see any advantage in passing a string like the first when the second is more easily manageable. It might seem more compact: but why choosing to entangle input when you have to disentangle it anyways? – egreg Mar 16 '12 at 20:55
  • It's because you are not taking into account the fact that I am using it in an application. Just because I gave you a simple example does not and you do not see any reason for it doesn't mean there isn't a reason. – Uiy Mar 16 '12 at 20:58
  • It's like asking "What is the use of a new born baby"? By itself a baby is absolutely useless but does that mean it is useless in the proper context? – Uiy Mar 16 '12 at 20:59