13

I have strings of the following form:

string = "ABC123DEFG456HI89UZXX1";

Letter keys of variable lengths are followed by (positive) integers.

I want to get this transformation:

{{"ABC", 123}, {"DEFG", 456}, {"HI", 89}, {"UZXX", 1}}

I have written:

chars = Characters @ string;

runs = Length /@ Split[IntegerQ @ ToExpression @ # & /@ chars]

{3, 3, 4, 3, 2, 2, 4, 1}

takes = Transpose[{# - runs + 1, #}] & [Accumulate @ runs]

{{1, 3}, {4, 6}, {7, 10}, {11, 13}, {14, 15}, {16, 17}, {18, 21}, {22, 22}}

 result =
    Partition[StringJoin /@ Map[Take[chars, #] &, takes], 2] /.
       {a_String, b_String} :> {a, ToExpression @ b}

{{"ABC", 123}, {"DEFG", 456}, {"HI", 89}, {"UZXX", 1}}

I have two questions:

(1) How could the above coding be shortened / improved? (it seems to be too long for such a trivial problem)

(2) How would a "direct" method (StringCases , StringSplit ...) look like?

eldo
  • 67,911
  • 5
  • 60
  • 168

7 Answers7

16

You asked for shortened, improved, so here it is using RegularExpressions:

StringCases[string, RegularExpression["(\\D+)(\\d+)"] :> {"$1", ToExpression["$2"]}]
{{"ABC", 123}, {"DEFG", 456}, {"HI", 89}, {"UZXX", 1}}

Here's a version using StringSplit:

Partition[StringSplit[string, RegularExpression["(\\d+)"] :> FromDigits @ "$1"], 2]
RunnyKine
  • 33,088
  • 3
  • 109
  • 176
8
string = "ABC123DEFG456HI89UZXX1";

Partition[StringSplit[string, x : NumberString :> ToExpression@x], 2]

or

Partition[StringSplit[string, x : NumberString :> FromDigits@x], 2]

or

Split[StringSplit[string, x : NumberString :> FromDigits@x], Head@#2 === Integer &]
(* {{"ABC",123},{"DEFG",456},{"HI",89},{"UZXX",1}} *)
kglr
  • 394,356
  • 18
  • 477
  • 896
  • 2
    Somehow amazing that your short poem works. But inspecting it with Trace it became clearer to me :) – eldo Sep 14 '14 at 20:05
7

With all the answers we need a timing comparison.

Functions:

bel[string_] := 
  StringSplit[string, 
    PatternSequence[x : Longest[LetterCharacter ..], 
      y : Longest[DigitCharacter ..]] :> {x, ToExpression@y}][[;; ;; 2]];
RK1[string_] := 
  StringCases[string, RegularExpression["(\\D+)(\\d+)"] :> {"$1", ToExpression["$2"]}];
RK2[string_] := 
  Partition[StringSplit[string, RegularExpression["(\\d+)"] :> FromDigits@"$1"], 2];
ybk[string_] := 
  Partition[#, 2] &@StringSplit[string, x : DigitCharacter .. :> ToExpression@x];
kg1[string_] := Partition[StringSplit[string, x : NumberString :> ToExpression@x], 2];
kg2[string_] := Partition[StringSplit[string, x : NumberString :> FromDigits@x], 2];
kuba[string_] := 
  StringCases[string, x : LetterCharacter .. ~~ y : NumberString :> {x, ToExpression@y}];
al1[string_] := Module[{nu, le},
   nu = ToExpression[StringCases[string, DigitCharacter ..]];
   le = StringCases[string, LetterCharacter ..];
   Transpose[{le, nu}]
   ];
al2[string_] := Module[{nu, le},
   nu = ToExpression[StringSplit[string, __?LetterQ]];
   le = StringSplit[string, __?DigitQ];
   Transpose[{le, nu}]
   ];

Generator:

g = "a" <> RandomChoice[
     Join @@ ConstantArray @@@ {{2, 10}, {1, 26}} -> 
      CharacterRange["0", "Z"]~Drop~{11, 17}, #] <> "1" &;

Benchmark Plot:

Needs["GeneralUtilities`"]

BenchmarkPlot[{bel, RK1, RK2, ybk, kg1, kg2, kuba, al1, al2}, g, "IncludeFits" -> True]

enter image description here
(click for larger)

All methods have similar complexity except Algohi's second method which incurs a heavy penalty.

The winning method is kguler's second function, followed closely by RunnyKine's second function which is nearly the same thing except for the use of regular expressions.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
6

Without Reg. Exp.:

string = "ABC123DEFG456HI89UZXX1";
StringSplit[string, PatternSequence[x : LetterCharacter .., y : DigitCharacter ..] :>
                                                   {x, ToExpression@y}][[;; ;; 2]]
(* {{"ABC", "123"}, {"DEFG", "456"}, {"HI", "89"}, {"UZXX", "1"}} *)
Dr. belisarius
  • 115,881
  • 13
  • 203
  • 453
  • 1
    @RunnyKine Yup. Thanks – Dr. belisarius Sep 14 '14 at 19:19
  • 1
    You come to StringCases with these revisions :) – ybeltukov Sep 14 '14 at 19:21
  • This code is currently broken due to the use of -> rather than :> in the rule. This means that ToExpression@y directly evaluates to y, therefore its existence is pointless and the output is in the wrong format. (Also x and y are not localized.) I corrected this but the edit was reverted. – Mr.Wizard Sep 15 '14 at 07:09
6

StringSplit with Partition works fine for this particular case

Partition[#, 2] &@StringSplit[string, x : DigitCharacter .. :> ToExpression@x]
(* {{"ABC", 123}, {"DEFG", 456}, {"HI", 89}, {"UZXX", 1}} *)
ybeltukov
  • 43,673
  • 5
  • 108
  • 212
6
StringCases[string, 
            x : LetterCharacter .. ~~ y : NumberString :> {x, ToExpression@y}]
Kuba
  • 136,707
  • 13
  • 279
  • 740
4
nu = ToExpression[StringCases[string, DigitCharacter ..]];
le = StringCases[string, LetterCharacter ..];
Transpose[{le, nu}]

using StringSplit:

nu = ToExpression[StringSplit[string, __?LetterQ]];
le = StringSplit[string, __?DigitQ];
Transpose[{le, nu}]
Basheer Algohi
  • 19,917
  • 1
  • 31
  • 78