5

Say I have a string that contains numbers and words, such as this one:

string = "there are 1234 words and numbers 5678 in here $999";

How would I separate the string into an ordered list containing sublists populated with words and numbers? The ideal list would look like this:

idealList = {{there are}, {1234}, {words and numbers}, {5678}, {in here}, {$999}}

I know how to extract all words and all numbers, but I can't create a list like the previous one.


Here's an example of what I tried to extract words and its output:

StringCases[string, RegularExpression["\\w(?<!\\d)[\\w'-]*"]]
{there, are, words, and, numbers, in, here}

I can also do this with pattern-matching instead of RegEx, but it doesn't get me closer to my goal.

Is my regex simply wrong, or does this problem require a tiny bit more involved solution?

CHM
  • 3,395
  • 26
  • 46
  • In my view using regular expression is not good, in the sense that use of regular expression degrades the performance, it take more time. –  Jul 05 '12 at 04:14
  • @gaurab: in the absence of concrete evidence (e.g. timing results) to back up your claim, your words are far from being an answer. I have thus turned your "answer" into a comment. – J. M.'s missing motivation Jul 05 '12 at 04:23

4 Answers4

12

What about this?

StringSplit[string, i : NumberString :> i]

Ok, everyone's giving answers that actually work with the $, so here's an edit, as @kguler and @MrWizard suggested

StringSplit[string, i : ("" | "$" ~~ NumberString) :> i] // StringTrim
Rojo
  • 42,601
  • 7
  • 96
  • 188
  • This works, but can you explain how it works please? – CHM Jul 05 '12 at 00:51
  • @CHM, it splits the strings where it finds a NumberString, keeping the NumberString – Rojo Jul 05 '12 at 00:52
  • You can also modify the pattern to use Alternatives as in StringSplit[string, i : NumberString | StringExpression["$" ~~ NumberString] :> i] or, StringSplit[string, i : NumberString | ("$" ~~ NumberString) :> i] (+1). – kglr Jul 05 '12 at 04:22
  • +1 -- however, I think you shouldn't need StringTrim after using StringSplit[string]. – Mr.Wizard Jul 05 '12 at 09:02
  • @kguler, I edited. I had first tried that but fast and was bitten by ~~ vs |'s precedence and didn't spend a second realising that was the issue – Rojo Jul 05 '12 at 12:25
  • @Mr.Wizard, you're right, Riffle only puts stuff in between – Rojo Jul 05 '12 at 12:26
  • Yes, but after your edit you need StringTrim again! :^) Try this: StringSplit[string, i : ("" | "$" ~~ NumberString) :> i] // StringTrim – Mr.Wizard Jul 05 '12 at 12:31
  • Wanna go to work in my place today @MrWizard? It's not my day :) – Rojo Jul 05 '12 at 12:34
  • @Rojo lol -- not if you want a job to return to. =:-O – Mr.Wizard Jul 05 '12 at 13:05
6

Note that Rojo's solution splits the expression containing the dollar sign as well:

StringSplit["there are 1234 words and numbers 5678 in here $999", i : NumberString :> i]
{"there are ", "1234", " words and numbers ", "5678", " in here $", "999"}

If you don't want that splitting to happen, here's one way, using a regex:

StringSplit["there are 1234 words and numbers 5678 in here $999",
            s : RegularExpression[".(\\d+)."] :> s]
{"there are", " 1234 ", "words and numbers", " 5678 ", "in here ", "$999"}

If the spaces in the ends of the strings are bothersome, you can use StringTrim[] to get rid of them:

StringSplit["there are 1234 words and numbers 5678 in here $999",
            s : RegularExpression[".(\\d+)."] :> s] // StringTrim
{"there are", "1234", "words and numbers", "5678", "in here", "$999"}

As another example:

str1 = "At 50x magnification, they'd better be paying me $1080 in 9 installments!";

StringSplit[str1, s : RegularExpression[".(\\d+)."] :> s] // StringTrim
{"At", "50x", "magnification, they'd better be paying me", "$1080", "in",
 "9", "installments!"}

The other methods presented would perform a splitting like

{"At", "50", "x magnification, they'd better be paying me", "$1080", "in",
 "9", "installments!"}

which may or may not be the desired behavior...

J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
  • One could of course use a StringExpression[] instead of a RegularExpression[]: s : (_ ~~ DigitCharacter .. ~~ _) :> s. – J. M.'s missing motivation Jul 05 '12 at 04:06
  • StringSplit["there are 1234 words and numbers 5678 in here $999", RegularExpression["\\$?(\\d)+"] -> "$0"] // StringTrima little bit more compact – Murta Jan 22 '13 at 00:22
3

I prefer the solutions given by Rojo and J.M. to the following one. But if you want to see a working version of your original approach with StringCases and RegularExpression, here is one possibility

StringCases[string, RegularExpression["([A-Za-z]|\\s)+|(\\$|\\d)+"]]

It returns

{"there are ", "1234", " words and numbers ", "5678", " in here ","$999"}

and as J.M. suggests above, apply StringTrim if desired. Handling decimals could also easily be added.

Robert Miller
  • 360
  • 1
  • 4
  • 9
3

Further variations:

using StringReplace:

List @@ StringTrim /@ StringReplace[string, 
    a : Except[{"$", DigitCharacter}] .. | NumberString | ("$" ~~ NumberString) :> {a}]

or, using the same replacement rule in StringCases:

 StringTrim /@ StringCases[string, 
   a : Except[{"$", DigitCharacter}] .. | NumberString | ("$" ~~ NumberString) :> {a}]

both yield:

enter image description here

kglr
  • 394,356
  • 18
  • 477
  • 896