32

This seems like it should be trivial, but how do I partition a string into length n substrings? I can of course write something like

chunk[s_, n_] := StringJoin[#] & /@ Partition[Characters[s], n]

so that chunk["ABCDEF",2] -> {"AB","CD","EF"} but this appears unnecessarily cumbersome.

István Zachar
  • 47,032
  • 20
  • 143
  • 291
David G
  • 629
  • 4
  • 6

9 Answers9

25

Try this:

StringCases["ABCDEFGHIJK", LetterCharacter ~~ LetterCharacter]

{"AB", "CD", "EF", "GH", "IJ"}

or for more general cases (i.e. not just for letters, but any characters, and for any partition size):

stringPartition1[s_String, n_Integer] := StringCases[s, StringExpression @@ Table[_, {n}]];

It is more elegant though to use Repeated (thanks rcollyer):

stringPartition2[s_String, n_Integer] := StringCases[s, Repeated[_, {n}]];

stringPartition2["longteststring", 4]

{"long", "test", "stri"}

István Zachar
  • 47,032
  • 20
  • 143
  • 291
20

Here is the regular-expression way:

chunk[s_, n_] := 
 StringCases[s, RegularExpression[".{1," <> ToString[n] <> "}"]]

chunk["Hello this is a test string", 2]

{"He", "ll", "o ", "th", "is", " i", "s ", "a ", "te", "st", " s", "tr", "in", "g"}

chunk["Hello this is a test string", 4]

{"Hell", "o th", "is i", "s a ", "test", " str", "ing"}

Note that the last substrings didn't fit the chunk size but were still included.

If you don't want to include them, change the regular expression from ".{1," <> ToString[n] <> "}" to ".{" <> ToString[n] <> "}".

Jens
  • 97,245
  • 7
  • 213
  • 499
15

Another possibility:

StringTake[#, 
   Partition[Range@StringLength@#, 2, 2, 1, {}]] &@"abcdefghi"

giving

(*  {"ab", "cd", "ef", "gh", "i"} *)
user1066
  • 17,923
  • 3
  • 31
  • 49
12

Not completely original, but very compact.

chunk[s_, n_] := StringJoin@@@Partition[Characters[s], n, n, 1, {}]

Update for V10.1

This new function is exactly for that:

StringPartition["ABCDEF",2]

{"AB", "CD", "EF"}

J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
Murta
  • 26,275
  • 6
  • 76
  • 166
8

This will give better performance (3 times faster in my test, partitioning into length-two strings) than your original code:

chunk[s_, n_] := FromCharacterCode@Partition[ToCharacterCode[s], n]

The reason is that the first few steps of the computation are done with packed arrays.

It will still be slower than the regex-based approaches (István's and Jens's), on my machine by a factor of 2.

The StringTake approach is much slower than all the others in my machine.

Benchmarks

Function definitions:

(* original *)
chunk1[s_, n_] := StringJoin[#] & /@ Partition[Characters[s], n]

(* István *) chunk2[s_, n_] := StringCases[s, Repeated[_, {n}]]

(* Jens *) chunk3[s_, n_] := StringCases[s, RegularExpression[".{1," <> ToString[n] <> "}"]]

(* TomD *) chunk4 = StringTake[#, Partition[Range@StringLength@#, #2, #2, 1, {}]] &;

(* mine *) chunk5[s_, n_] := FromCharacterCode@Partition[ToCharacterCode[s], n]

text = ExampleData[{"Text", "Hamlet"}]; testString = StringJoin[ConstantArray[text, 20]];

StringLength[testString] (* 3438740 *)

Timings:

(* original *)
In[10]:= Timing[chunk1[testString, 2];]
         Timing[chunk1[testString, 100];]

Out[10]= {5.968, Null} Out[11]= {1.703, Null}

(* István - fastest *) In[12]:= Timing[chunk2[testString, 2];] Timing[chunk2[testString, 100];]

Out[12]={1.25, Null} Out[13]={0.11, Null}

(* Jens - fastest *) In[14]:= Timing[chunk3[testString, 2];] Timing[chunk3[testString, 100];]

Out[14]= {1.313, Null} Out[15]= {0.125, Null}

(* TomD *) In[16]:= Timing[chunk4[testString, 2];] Timing[chunk4[testString, 100];]

(* More than a few minutes. Didn't wait for it to finish ... *)

(* mine *) In[18]:= Timing[chunk5[testString, 2];] Timing[chunk5[testString, 100];]

Out[18]= {2.25, Null} Out[19]= {0.266, Null}

Conclusion: use regex-based methods. The built-in string patterns also use a regex library internally, I believe, but they are easier to construct programmatically because they are represented as expressions.

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
  • Scrollbar ate your timing results... First I thought it was so lightning fast you did not bother to write it out in numbers :) – István Zachar May 17 '12 at 09:54
  • 2
    Thanks, Szabolcs, and everyone who answered. As ever, more than one way to skin a cat with Mathematica. I'm a bit surprised that there isn't a StringPartition function taking the same arguments as Partition as a built-in, analogous to StringTake vs. Take. – David G May 17 '12 at 13:23
  • +1, I did not know about that behavior of ToCharacterCode and FromCharacterCode. It makes splitting a string into individual characters a lot simpler. – rcollyer May 17 '12 at 13:38
  • 3
    Your code can be further sped up by using Developer`PartitionMap to apply FromCharacterCode to each term in the list. This does not unpack the list. The speed up is marginal, though, on my machine 1.25793 -> 1.12617 and 0.135588 -> 0.095252. So, still not a contender for the fastest method. – rcollyer May 17 '12 at 16:38
  • 1
    @rcollyer Good point! I never used Developer`PartitionMap before. (I've seen it in the docs, but I didn't realize its significance: not unpacking.) – Szabolcs May 17 '12 at 16:40
  • I really would like to see the internals of the Developer` package. Mostly to see how one goes about avoiding unpacking. That, and I'd like a variant of that that works with Internal`PartitionRagged. – rcollyer May 17 '12 at 16:43
  • @rcollyer It should be possible to write something like PartitionMap using LibraryLink. I have to admit though I never manipulated general expressions with it (only numerical tensors). The idea would be not to map first, evaluate later (i.e. Sqrt /@ {4,5} -> `{Sqrt[4], Sqrt[5]} -> {2, Sqrt[5]}), but evaluate first, then insert into the (preallocated) array. – Szabolcs May 17 '12 at 16:48
  • I had not thought of using LibraryLink to do that sort of processing ... – rcollyer May 17 '12 at 16:51
  • @rcollyer Actually, what I described could be implemented efficiently in Mathematica, I think, The difficulty would be to keep the result array packed as well. So I started experimenting, and found something I don't really understand: if Map maps first, and leaves the result to be evaluated later, as my example was meant to demostrate above, then why is Developer`PackedArrayQ[Sqrt /@ N@Range[100]] === True? – Szabolcs May 17 '12 at 16:55
  • @rcollyer I get the same behaviour with Developer`PackedArrayQ[Identity /@ N@Range[100]]---no unpacking. But I can't manage to create a function myself that I could use in place of Identity or Sqrt in this example and it would not unpack. Does Map simply special case all these? I understand they would special-case Sqrt, but why Identity? – Szabolcs May 17 '12 at 16:58
  • It is length based. Try, using Range[50] instead. – rcollyer May 17 '12 at 17:07
  • 1
    @rcollyer I solved the 'mystery': it was auto-compilation. It avoids unpacking only if the list is above the auto-compilation length. – Szabolcs May 17 '12 at 17:08
  • @rcollyer Your comment was not yet there when I started writing mine, but it appeared when I finished :-) – Szabolcs May 17 '12 at 17:08
6

I wanted a solution using StringSplit[]

chunk[s_, n_] := StringSplit[s, RegularExpression["(.{" <> ToString@n <> "})"] -> "$1"]
                                                                     ~ DeleteCases ~ ""
Dr. belisarius
  • 115,881
  • 13
  • 203
  • 453
  • Cleaner, IMO: StringSplit[s, x : Repeated[_, {n}] :> x][[;; ;; 2]]. I don't see an advantage over StringCases however. – Mr.Wizard May 04 '14 at 11:05
  • @Mr.Wizard No advantage. I just wanted a way to use StringSplit[] as said. But I'm too old to remember why I wanted that. – Dr. belisarius May 05 '14 at 03:49
4

Historical note:

WolframLanguageData["StringPartition"
 , {"VersionIntroduced", "DateIntroduced"}]

{10.1, DateObject[{2015, 3, 30}, "Day", "Gregorian", 5.]}


The command supports UpTo as well as offset specs, similar to Partition.

str = "ABCDEFGHIJK";

StringPartition[str, UpTo[2]]

{"AB", "CD", "EF", "GH", "IJ", "K"}

StringPartition[str, 2, 1]

{"AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK"}

StringPartition[str, UpTo[7]]

{"ABCDEFG", "HIJK"}

Syed
  • 52,495
  • 4
  • 30
  • 85
4

Using SequenceCases (new in 10.1)

str = "ABCDEFGHIJK";

SequenceCases[Characters[str], x : {Repeated[_, {2}]} :> StringJoin[x]]

{"AB", "CD", "EF", "GH", "IJ"}

SequenceCases[Characters[str], x : {Repeated[_, {1, 2}]} :> StringJoin[x]]

{"AB", "CD", "EF", "GH", "IJ", "K"}

SequenceCases[Characters[str], x : {Repeated[_, {1, 7}]} :> StringJoin[x]]

{"ABCDEFG", "HIJK"}

SequenceCases[
 Characters[str], 
 x : {Repeated[_, {2}]} :> StringJoin[x], 
 Overlaps -> True]

{"AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK"}

SequenceCases[
 Characters[str], 
 x : {Repeated[_, {1, 2}]} :> StringJoin[x], 
 Overlaps -> True]

{"AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK", "K"}

eldo
  • 67,911
  • 5
  • 60
  • 168
1
str = "ABCDEFGHIJK";

Using MovingMap:

MovingMap["" <> # &, Characters[str], 1][[1 ;; -1 ;; 2]]

{"AB", "CD", "EF", "GH", "IJ"}

E. Chan-López
  • 23,117
  • 3
  • 21
  • 44