4

Consider a string containing ASCII characters (letters, digits, symbols). For example:

mystr="AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmn";

How do I most conveniently break this string into a list of unique characters contained in the string along with the number of appearance of each character?

Kagaratsch
  • 11,955
  • 4
  • 25
  • 72

4 Answers4

11

Given:

mystr = "AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmn";

From version 10 onward, we can get a count of occurrences for each character like this:

count = mystr // Characters // Counts

(* <| "A" -> 1, "H" -> 3, "j" -> 2, "k" -> 3, "s" -> 3, "l" -> 1,
      "n" -> 2, "K" -> 2, "J" -> 2, "B" -> 1, "S" -> 1, "E" -> 2,
      "W" -> 1, "-" -> 1, "Q" -> 1, "L" -> 1, "M" -> 1, " " -> 1,
      "+" -> 1, "d" -> 1, "f" -> 1, "b" -> 1, "G" -> 1, "m" -> 1 |> *)

If we want these in order, we can sort:

count // Sort

(* <| "-" -> 1, "+" -> 1, " " -> 1, "A" -> 1, "b" -> 1, "B" -> 1,
      "d" -> 1, "f" -> 1, "G" -> 1, "l" -> 1, "L" -> 1, "m" -> 1,
      "M" -> 1, "Q" -> 1, "S" -> 1, "W" -> 1, "E" -> 2, "j" -> 2,
      "J" -> 2, "K" -> 2, "n" -> 2, "H" -> 3, "k" -> 3, "s" -> 3 |> *)

We can also use count to obtain the counts of individual characters:

count["A"]
(* 1 *)

count["H"]
(* 3 *)

Prior to Version 10

Prior to version 10, we can use Tally and Rule to get the same effect:

count2 = mystr // Characters // Tally  // Rule @@@ # &

(* { "A" -> 1, "H" -> 3, "j" -> 2, "k" -> 3, "s" -> 3, "l" -> 1,
     "n" -> 2, "K" -> 2, "J" -> 2, "B" -> 1, "S" -> 1, "E" -> 2,
     "W" -> 1, "-" -> 1, "Q" -> 1, "L" -> 1, "M" -> 1, " " -> 1,
     "+" -> 1, "d" -> 1, "f" -> 1, "b" -> 1, "G" -> 1, "m" -> 1 } *)

"E" /. count2
(* 2 *)

CharacterCounts (Version 10.1+)

As @yode points out in a comment, version 10.1 introduced the helper function LetterCounts. It also introduced CharacterCounts:

mystr // CharacterCounts

(* <| "s" -> 3, "k" -> 3, "H" -> 3, "n" -> 2, "K" -> 2, "J" -> 2,
      "j" -> 2, "E" -> 2, "W" -> 1, "S" -> 1, "Q" -> 1, "M" -> 1,
      "m" -> 1, "L" -> 1, "l" -> 1, "G" -> 1, "f" -> 1, "d" -> 1,
      "B" -> 1, "b" -> 1, "A" -> 1, " " -> 1, "+" -> 1, "-" -> 1 |> *)

Note that the result is sorted in descending order by frequency. The documentation does not mention this fact, but an inspection of the definition of CharacterCounts reveals that it is intentional. CharacterCounts also supports the IgnoreCase keyword and counting n-grams:

CharacterCounts[mystr, 2, IgnoreCase -> True]

(* <| "ah" -> 1, "hj" -> 1, "jk" -> 1, "ks" -> 1, "sh" -> 1,
      "hs" -> 1, "sl" -> 1, "lk" -> 1, "kn" -> 1, "nh" -> 1,
      "hk" -> 1, "kj" -> 3, "jb" -> 2, "bs" -> 1, "se" -> 1,
      "ew" -> 1, "w-" -> 1, "-q" -> 1, "ql" -> 1, "lm" -> 1,
      "m " -> 1, " +" -> 1, "+s" -> 1, "sd" -> 1, "df" -> 1,
      "fk" -> 1, "bk" -> 1, "jg" -> 1, "ge" -> 1, "em" -> 1,
      "mn" -> 1 |> *)
WReach
  • 68,832
  • 4
  • 164
  • 269
6

An efficient and straightforward approach is to use Characters in combination with Tally:

mystr = "AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmnm";
counts = Tally@Characters[mystr]
{{"A", 1}, {"H", 3}, {"j", 2}, {"k", 3}, {"s", 3}, {"l", 1}, {"n", 2}, {"K", 2}, {"J", 2}, 
 {"B", 1}, {"S", 1}, {"E", 2}, {"W", 1}, {"-", 1}, {"Q", 1}, {"L", 1}, {"M", 1}, {" ", 1}, 
 {"+", 1}, {"d", 1}, {"f", 1}, {"b", 1}, {"G", 1}, {"m", 2}}

Timing comparison (version 11.0.0):

counts1 = Tally@Characters[#] &;
counts2 = Tally@StringSplit[#, ""] &;
counts3 = Counts@Characters[#] &;
counts4 = Module[{t = Tally[ToCharacterCode[#]]}, 
    Transpose[{FromCharacterCode@Transpose[{t[[All, 1]]}], t[[All, 2]]}]] &;
counts5 = MapAt[FromCharacterCode, Tally[ToCharacterCode[#]], {All, 1}] &;

mystr[n_] := StringJoin@RandomChoice[CharacterRange[" ", "~"], n];

Needs["GeneralUtilities`"]

benchmarks = Benchmark[#, mystr, Array[2^# &, 22, 6], TimeConstraint -> 100] & /@ 
   {counts1, counts2, counts3, counts4, counts5};

For better readability I use a set of plot markers which tolerate overlapping from my PolygonPlotMarkers package:

Needs["PolygonPlotMarkers`"]

ListLogLogPlot[benchmarks, ImageSize -> 600, PlotMarkers -> {
   Graphics[{FaceForm[Magenta], EdgeForm[], PolygonMarker["Circle", Offset[7]]}], 
   Graphics[{FaceForm[], EdgeForm[{Brown, Opacity[1], AbsoluteThickness[2]}], 
     PolygonMarker["Square", Offset[8]]}], 
   Graphics[{FaceForm[RGBColor[0.34, 0.4, 0.62]], EdgeForm[], 
     PolygonMarker["TripleCross", Offset[7]]}], 
   Graphics[{FaceForm[Red], EdgeForm[], PolygonMarker["FourPointedStar", Offset[9]]}], 
   Graphics[{FaceForm[Blue], EdgeForm[None], PolygonMarker["DiagonalCross", Offset[7]]}]},
  PlotTheme -> "Detailed", BaseStyle -> FontSize -> 16,
  PlotLegends -> Placed[PointLegend[{"counts1", "counts2", "counts3", "counts4", "counts5"}, 
    LegendMarkerSize -> 15,
    LegendFunction -> (Framed[#, Background -> White, RoundingRadius -> 7] &)], Scaled[{.85, .3}]]]

plot

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
4

Is this what you mean?

mystr = "AHjksHslknHKJBSEW-QLM +sdfkjbKJGEmnm"; 
lst = StringSplit[mystr, ""];
Tally@lst

Mathematica graphics

Nasser
  • 143,286
  • 11
  • 154
  • 359
  • If you Union prior to Tally the count will always be one for each character. Try lst = Union@Tally@StringSplit[mystr, ""] – Bob Hanlon Feb 13 '17 at 02:58
  • @BobHanlon thank you, I was making this clarification while you were writing the comment. I added 2 cases. But the question asked for list of unique characters contained in the string along with the number of appearance of each character and so I first did exactly as it asked :) – Nasser Feb 13 '17 at 03:01
  • I still think you want Union@Tally@lst or more simply just Sort@Tally@lst since the Union is just sorting because the first part of the elements of the Tally are distinct. – Bob Hanlon Feb 13 '17 at 03:07
4

This should be quite a bit faster than solutions posted so far, and I'd guess order(s) of magnitude faster for large strings:

counts=Module[{t = Tally[ToCharacterCode[#]]}, 
              Transpose[{FromCharacterCode@Transpose[{t[[All, 1]]}],t[[All, 2]]}]]&;

Use:

results=counts@mystr
ciao
  • 25,774
  • 2
  • 58
  • 139