Partition string into chunks

Question

This seems like it should be trivial, but how do I partition a string into length n substrings? I can of course write something like

chunk[s_, n_] := StringJoin[#] & /@ Partition[Characters[s], n]

so that chunk["ABCDEF",2] -> {"AB","CD","EF"} but this appears unnecessarily cumbersome.

Welcome to Mathematica.SE! This is a good question, as there doesn't seem to be a direct built-in way (I bumped into this before). Please consider filling out the name field in your profile, so it will show as something easier to remember than 'user1268' — Szabolcs, May 17 '12 at 09:19

István Zachar · Accepted Answer · 2012-05-17T09:50:02.877

25

Try this:

StringCases["ABCDEFGHIJK", LetterCharacter ~~ LetterCharacter]

{"AB", "CD", "EF", "GH", "IJ"}

or for more general cases (i.e. not just for letters, but any characters, and for any partition size):

stringPartition1[s_String, n_Integer] := StringCases[s, StringExpression @@ Table[_, {n}]];

It is more elegant though to use Repeated (thanks rcollyer):

stringPartition2[s_String, n_Integer] := StringCases[s, Repeated[_, {n}]];

stringPartition2["longteststring", 4]

{"long", "test", "stri"}

edited May 17 '12 at 09:50

answered May 16 '12 at 22:52

István Zachar

47,032
20
143
291

7

Instead of Table[_,{n}], I'd consider using Repeated[_, {n}], instead. – rcollyer May 17 '12 at 01:34
Thanks @rcollyer, I felt I missed something, but it was too late. Incorporated now. – István Zachar May 17 '12 at 09:50

Jens · Answer 2 · 2012-05-17T00:16:36.017

Here is the regular-expression way:

chunk[s_, n_] := 
 StringCases[s, RegularExpression[".{1," <> ToString[n] <> "}"]]

chunk["Hello this is a test string", 2]

{"He", "ll", "o ", "th", "is", " i", "s ", "a ", "te", "st", " s", "tr", "in", "g"}

chunk["Hello this is a test string", 4]

{"Hell", "o th", "is i", "s a ", "test", " str", "ing"}

Note that the last substrings didn't fit the chunk size but were still included.

If you don't want to include them, change the regular expression from ".{1," <> ToString[n] <> "}" to ".{" <> ToString[n] <> "}".

score 15 · Answer 3 · answered May 17 '12 at 00:20

15

Another possibility:

StringTake[#, 
   Partition[Range@StringLength@#, 2, 2, 1, {}]] &@"abcdefghi"

giving

(*  {"ab", "cd", "ef", "gh", "i"} *)

answered May 17 '12 at 00:20

user1066

17,923
3
31
49

score 12 · Answer 4 · edited Nov 24 '15 at 07:20

12

Not completely original, but very compact.

chunk[s_, n_] := StringJoin@@@Partition[Characters[s], n, n, 1, {}]

Update for V10.1

This new function is exactly for that:

StringPartition["ABCDEF",2]

{"AB", "CD", "EF"}

edited Nov 24 '15 at 07:20

J. M.'s missing motivation

124,525
11
401
574

answered Jan 22 '13 at 00:00

Murta

26,275
6
76
166

score 8 · Answer 5 · edited Jun 16 '20 at 09:23

8

This will give better performance (3 times faster in my test, partitioning into length-two strings) than your original code:

chunk[s_, n_] := FromCharacterCode@Partition[ToCharacterCode[s], n]

The reason is that the first few steps of the computation are done with packed arrays.

It will still be slower than the regex-based approaches (István's and Jens's), on my machine by a factor of 2.

The StringTake approach is much slower than all the others in my machine.

Benchmarks

Function definitions:

(* original *)
chunk1[s_, n_] := StringJoin[#] & /@ Partition[Characters[s], n]
(* István *)
chunk2[s_, n_] := StringCases[s, Repeated[_, {n}]]
(* Jens *)
chunk3[s_, n_] := StringCases[s, RegularExpression[".{1," <> ToString[n] <> "}"]]
(* TomD *)
chunk4 = StringTake[#, Partition[Range@StringLength@#, #2, #2, 1, {}]] &;
(* mine *)
chunk5[s_, n_] := FromCharacterCode@Partition[ToCharacterCode[s], n]
text = ExampleData[{"Text", "Hamlet"}];
testString = StringJoin[ConstantArray[text, 20]];
StringLength[testString] (* 3438740 *)

Timings:

(* original *)
In[10]:= Timing[chunk1[testString, 2];]
         Timing[chunk1[testString, 100];]
Out[10]= {5.968, Null}
Out[11]= {1.703, Null}
(* István - fastest *)
In[12]:= Timing[chunk2[testString, 2];]
         Timing[chunk2[testString, 100];]
Out[12]={1.25, Null}
Out[13]={0.11, Null}
(* Jens - fastest *)
In[14]:= Timing[chunk3[testString, 2];]
         Timing[chunk3[testString, 100];]
Out[14]= {1.313, Null}
Out[15]= {0.125, Null}
(* TomD *)
In[16]:= Timing[chunk4[testString, 2];]
         Timing[chunk4[testString, 100];]
(* More than a few minutes. Didn't wait for it to finish ... *)
(* mine *)
In[18]:= Timing[chunk5[testString, 2];]
         Timing[chunk5[testString, 100];]
Out[18]= {2.25, Null}
Out[19]= {0.266, Null}

Conclusion: use regex-based methods. The built-in string patterns also use a regex library internally, I believe, but they are easier to construct programmatically because they are represented as expressions.

edited Jun 16 '20 at 09:23

Community

1

answered May 17 '12 at 09:36

Szabolcs

234,956
30
623
1,263

Scrollbar ate your timing results... First I thought it was so lightning fast you did not bother to write it out in numbers :) – István Zachar May 17 '12 at 09:54
2

Thanks, Szabolcs, and everyone who answered. As ever, more than one way to skin a cat with Mathematica. I'm a bit surprised that there isn't a StringPartition function taking the same arguments as Partition as a built-in, analogous to StringTake vs. Take. – David G May 17 '12 at 13:23
+1, I did not know about that behavior of ToCharacterCode and FromCharacterCode. It makes splitting a string into individual characters a lot simpler. – rcollyer May 17 '12 at 13:38
3

Your code can be further sped up by using Developer`PartitionMap to apply FromCharacterCode to each term in the list. This does not unpack the list. The speed up is marginal, though, on my machine 1.25793 -> 1.12617 and 0.135588 -> 0.095252. So, still not a contender for the fastest method. – rcollyer May 17 '12 at 16:38
1

@rcollyer Good point! I never used Developer`PartitionMap before. (I've seen it in the docs, but I didn't realize its significance: not unpacking.) – Szabolcs May 17 '12 at 16:40
I really would like to see the internals of the Developer` package. Mostly to see how one goes about avoiding unpacking. That, and I'd like a variant of that that works with Internal`PartitionRagged. – rcollyer May 17 '12 at 16:43
@rcollyer It should be possible to write something like PartitionMap using LibraryLink. I have to admit though I never manipulated general expressions with it (only numerical tensors). The idea would be not to map first, evaluate later (i.e. Sqrt /@ {4,5} -> `{Sqrt[4], Sqrt[5]} -> {2, Sqrt[5]}), but evaluate first, then insert into the (preallocated) array. – Szabolcs May 17 '12 at 16:48
I had not thought of using LibraryLink to do that sort of processing ... – rcollyer May 17 '12 at 16:51
@rcollyer Actually, what I described could be implemented efficiently in Mathematica, I think, The difficulty would be to keep the result array packed as well. So I started experimenting, and found something I don't really understand: if Map maps first, and leaves the result to be evaluated later, as my example was meant to demostrate above, then why is Developer`PackedArrayQ[Sqrt /@ N@Range[100]] === True? – Szabolcs May 17 '12 at 16:55
@rcollyer I get the same behaviour with Developer`PackedArrayQ[Identity /@ N@Range[100]]---no unpacking. But I can't manage to create a function myself that I could use in place of Identity or Sqrt in this example and it would not unpack. Does Map simply special case all these? I understand they would special-case Sqrt, but why Identity? – Szabolcs May 17 '12 at 16:58
It is length based. Try, using Range[50] instead. – rcollyer May 17 '12 at 17:07
1

@rcollyer I solved the 'mystery': it was auto-compilation. It avoids unpacking only if the list is above the auto-compilation length. – Szabolcs May 17 '12 at 17:08
@rcollyer Your comment was not yet there when I started writing mine, but it appeared when I finished :-) – Szabolcs May 17 '12 at 17:08

Dr. belisarius · Answer 6 · 2015-11-24T12:59:33.677

6

I wanted a solution using StringSplit[]

chunk[s_, n_] := StringSplit[s, RegularExpression["(.{" <> ToString@n <> "})"] -> "$1"]
                                                                     ~ DeleteCases ~ ""

edited Nov 24 '15 at 12:59

answered Oct 03 '12 at 19:10

Dr. belisarius

115,881
13
203
453

Cleaner, IMO: StringSplit[s, x : Repeated[_, {n}] :> x][[;; ;; 2]]. I don't see an advantage over StringCases however. – Mr.Wizard May 04 '14 at 11:05
@Mr.Wizard No advantage. I just wanted a way to use StringSplit[] as said. But I'm too old to remember why I wanted that. – Dr. belisarius May 05 '14 at 03:49

score 4 · Answer 7 · answered Mar 23 '24 at 14:19

Historical note:

WolframLanguageData["StringPartition"
 , {"VersionIntroduced", "DateIntroduced"}]

{10.1, DateObject[{2015, 3, 30}, "Day", "Gregorian", 5.]}

The command supports UpTo as well as offset specs, similar to Partition.

str = "ABCDEFGHIJK";
StringPartition[str, UpTo[2]]

{"AB", "CD", "EF", "GH", "IJ", "K"}

StringPartition[str, 2, 1]

{"AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK"}

StringPartition[str, UpTo[7]]

{"ABCDEFG", "HIJK"}

score 4 · Answer 8 · answered Mar 23 '24 at 15:39

Using SequenceCases (new in 10.1)

str = "ABCDEFGHIJK";

SequenceCases[Characters[str], x : {Repeated[_, {2}]} :> StringJoin[x]]

{"AB", "CD", "EF", "GH", "IJ"}

SequenceCases[Characters[str], x : {Repeated[_, {1, 2}]} :> StringJoin[x]]

{"AB", "CD", "EF", "GH", "IJ", "K"}

SequenceCases[Characters[str], x : {Repeated[_, {1, 7}]} :> StringJoin[x]]

{"ABCDEFG", "HIJK"}

SequenceCases[
 Characters[str], 
 x : {Repeated[_, {2}]} :> StringJoin[x], 
 Overlaps -> True]

{"AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK"}

SequenceCases[
 Characters[str], 
 x : {Repeated[_, {1, 2}]} :> StringJoin[x], 
 Overlaps -> True]

{"AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK", "K"}

score 1 · Answer 9 · answered Mar 24 '24 at 00:23

1

str = "ABCDEFGHIJK";

Using MovingMap:

MovingMap["" <> # &, Characters[str], 1][[1 ;; -1 ;; 2]]

{"AB", "CD", "EF", "GH", "IJ"}

answered Mar 24 '24 at 00:23

E. Chan-López

23,117
3
21
44

Partition string into chunks

9 Answers9

Benchmarks

Linked

Related