Extract all unique characters and number of their appearance from string?

Question

Consider a string containing ASCII characters (letters, digits, symbols). For example:

mystr="AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmn";

How do I most conveniently break this string into a list of unique characters contained in the string along with the number of appearance of each character?

How about StringPartition and Tally? – bill s Feb 13 '17 at 02:40 — bill s, Feb 13 '17 at 02:40

WReach · Accepted Answer · 2017-02-14T22:21:26.133

Given:

mystr = "AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmn";

From version 10 onward, we can get a count of occurrences for each character like this:

count = mystr // Characters // Counts

(* <| "A" -> 1, "H" -> 3, "j" -> 2, "k" -> 3, "s" -> 3, "l" -> 1,
      "n" -> 2, "K" -> 2, "J" -> 2, "B" -> 1, "S" -> 1, "E" -> 2,
      "W" -> 1, "-" -> 1, "Q" -> 1, "L" -> 1, "M" -> 1, " " -> 1,
      "+" -> 1, "d" -> 1, "f" -> 1, "b" -> 1, "G" -> 1, "m" -> 1 |> *)

If we want these in order, we can sort:

count // Sort

(* <| "-" -> 1, "+" -> 1, " " -> 1, "A" -> 1, "b" -> 1, "B" -> 1,
      "d" -> 1, "f" -> 1, "G" -> 1, "l" -> 1, "L" -> 1, "m" -> 1,
      "M" -> 1, "Q" -> 1, "S" -> 1, "W" -> 1, "E" -> 2, "j" -> 2,
      "J" -> 2, "K" -> 2, "n" -> 2, "H" -> 3, "k" -> 3, "s" -> 3 |> *)

We can also use count to obtain the counts of individual characters:

count["A"]
(* 1 *)

count["H"]
(* 3 *)

Prior to Version 10

Prior to version 10, we can use Tally and Rule to get the same effect:

count2 = mystr // Characters // Tally  // Rule @@@ # &

(* { "A" -> 1, "H" -> 3, "j" -> 2, "k" -> 3, "s" -> 3, "l" -> 1,
     "n" -> 2, "K" -> 2, "J" -> 2, "B" -> 1, "S" -> 1, "E" -> 2,
     "W" -> 1, "-" -> 1, "Q" -> 1, "L" -> 1, "M" -> 1, " " -> 1,
     "+" -> 1, "d" -> 1, "f" -> 1, "b" -> 1, "G" -> 1, "m" -> 1 } *)

"E" /. count2
(* 2 *)

CharacterCounts (Version 10.1+)

As @yode points out in a comment, version 10.1 introduced the helper function LetterCounts. It also introduced CharacterCounts:

mystr // CharacterCounts

(* <| "s" -> 3, "k" -> 3, "H" -> 3, "n" -> 2, "K" -> 2, "J" -> 2,
      "j" -> 2, "E" -> 2, "W" -> 1, "S" -> 1, "Q" -> 1, "M" -> 1,
      "m" -> 1, "L" -> 1, "l" -> 1, "G" -> 1, "f" -> 1, "d" -> 1,
      "B" -> 1, "b" -> 1, "A" -> 1, " " -> 1, "+" -> 1, "-" -> 1 |> *)

Note that the result is sorted in descending order by frequency. The documentation does not mention this fact, but an inspection of the definition of CharacterCounts reveals that it is intentional. CharacterCounts also supports the IgnoreCase keyword and counting n-grams:

CharacterCounts[mystr, 2, IgnoreCase -> True]

(* <| "ah" -> 1, "hj" -> 1, "jk" -> 1, "ks" -> 1, "sh" -> 1,
      "hs" -> 1, "sl" -> 1, "lk" -> 1, "kn" -> 1, "nh" -> 1,
      "hk" -> 1, "kj" -> 3, "jb" -> 2, "bs" -> 1, "se" -> 1,
      "ew" -> 1, "w-" -> 1, "-q" -> 1, "ql" -> 1, "lm" -> 1,
      "m " -> 1, " +" -> 1, "+s" -> 1, "sd" -> 1, "df" -> 1,
      "fk" -> 1, "bk" -> 1, "jg" -> 1, "ge" -> 1, "em" -> 1,
      "mn" -> 1 |> *)

@yode No, I hadn't. Its sibling function CharacterCounts is particularly relevant. I have added a section that discusses it. Thanks. — WReach, Feb 14 '17 at 22:22

score 6 · Answer 2 · edited Apr 13 '17 at 12:55

An efficient and straightforward approach is to use Characters in combination with Tally:

mystr = "AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmnm";
counts = Tally@Characters[mystr]

{{"A", 1}, {"H", 3}, {"j", 2}, {"k", 3}, {"s", 3}, {"l", 1}, {"n", 2}, {"K", 2}, {"J", 2}, 
 {"B", 1}, {"S", 1}, {"E", 2}, {"W", 1}, {"-", 1}, {"Q", 1}, {"L", 1}, {"M", 1}, {" ", 1}, 
 {"+", 1}, {"d", 1}, {"f", 1}, {"b", 1}, {"G", 1}, {"m", 2}}

Timing comparison (version 11.0.0):

counts1 = Tally@Characters[#] &;
counts2 = Tally@StringSplit[#, ""] &;
counts3 = Counts@Characters[#] &;
counts4 = Module[{t = Tally[ToCharacterCode[#]]}, 
    Transpose[{FromCharacterCode@Transpose[{t[[All, 1]]}], t[[All, 2]]}]] &;
counts5 = MapAt[FromCharacterCode, Tally[ToCharacterCode[#]], {All, 1}] &;

mystr[n_] := StringJoin@RandomChoice[CharacterRange[" ", "~"], n];

Needs["GeneralUtilities`"]

benchmarks = Benchmark[#, mystr, Array[2^# &, 22, 6], TimeConstraint -> 100] & /@ 
   {counts1, counts2, counts3, counts4, counts5};

For better readability I use a set of plot markers which tolerate overlapping from my PolygonPlotMarkers package:

Needs["PolygonPlotMarkers`"]

ListLogLogPlot[benchmarks, ImageSize -> 600, PlotMarkers -> {
   Graphics[{FaceForm[Magenta], EdgeForm[], PolygonMarker["Circle", Offset[7]]}], 
   Graphics[{FaceForm[], EdgeForm[{Brown, Opacity[1], AbsoluteThickness[2]}], 
     PolygonMarker["Square", Offset[8]]}], 
   Graphics[{FaceForm[RGBColor[0.34, 0.4, 0.62]], EdgeForm[], 
     PolygonMarker["TripleCross", Offset[7]]}], 
   Graphics[{FaceForm[Red], EdgeForm[], PolygonMarker["FourPointedStar", Offset[9]]}], 
   Graphics[{FaceForm[Blue], EdgeForm[None], PolygonMarker["DiagonalCross", Offset[7]]}]},
  PlotTheme -> "Detailed", BaseStyle -> FontSize -> 16,
  PlotLegends -> Placed[PointLegend[{"counts1", "counts2", "counts3", "counts4", "counts5"}, 
    LegendMarkerSize -> 15,
    LegendFunction -> (Framed[#, Background -> White, RoundingRadius -> 7] &)], Scaled[{.85, .3}]]]

Nasser · Answer 3 · 2017-02-13T03:14:06.207

4

Is this what you mean?

mystr = "AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmnm"; 
lst = StringSplit[mystr, ""];
Tally@lst

Mathematica graphics

edited Feb 13 '17 at 03:14

answered Feb 13 '17 at 02:43

Nasser

143,286
11
154
359

If you Union prior to Tally the count will always be one for each character. Try lst = Union@Tally@StringSplit[mystr, ""] – Bob Hanlon Feb 13 '17 at 02:58
@BobHanlon thank you, I was making this clarification while you were writing the comment. I added 2 cases. But the question asked for list of unique characters contained in the string along with the number of appearance of each character and so I first did exactly as it asked :) – Nasser Feb 13 '17 at 03:01
I still think you want Union@Tally@lst or more simply just Sort@Tally@lst since the Union is just sorting because the first part of the elements of the Tally are distinct. – Bob Hanlon Feb 13 '17 at 03:07

score 4 · Answer 4 · answered Feb 13 '17 at 04:55

4

This should be quite a bit faster than solutions posted so far, and I'd guess order(s) of magnitude faster for large strings:

counts=Module[{t = Tally[ToCharacterCode[#]]}, 
              Transpose[{FromCharacterCode@Transpose[{t[[All, 1]]}],t[[All, 2]]}]]&;

Use:

results=counts@mystr

answered Feb 13 '17 at 04:55

ciao

25,774
2
58
139

MapAt[FromCharacterCode, Tally[ToCharacterCode[#]], {All, 1}] & gives almost the same performance (version 11.0.0). – Alexey Popkov Feb 13 '17 at 05:11
@AlexeyPopkov: Yep, pretty close. – ciao Feb 13 '17 at 05:28

Extract all unique characters and number of their appearance from string?

4 Answers4