4

Let $s_a$ and $s_b$ be two strings over a $q$-ary alphabet. For each character $(c_1,...c_q)$ in the alphabet, I'd like to return a pair of values $(r,k)$ where $r$ represents the count for the character in $s_a$ and $k$ represents the number of instances where this character is in the same position in $s_a$ and $s_b$.

For example, if we have two binary strings:

sa="00011001";
sb="11001001";

For the character $0$, we'd return the value $(r_0,k_0) = (5,3)$. For the character $1$, we'd return $(r_1,k_1) = (3,2)$.

What is the fastest way to carry out these string operations? Should we be using arrays instead of strings?

user11875
  • 43
  • 2

4 Answers4

3

This does the trick:

s1 = {0, 0, 0, 1, 1, 0, 0, 1};
s2 = {1, 1, 0, 0, 1, 0, 0, 1};
alphabet = {0, 1};

With[{ss = Transpose[{#1, #2}], s1 = #1, s2 = #2, alp = #3},
{#, {Count[s1, #], Count[ss, {#, #}]}} & /@ 
alp] &[s1, s2, alphabet]

(* {{0, {5, 3}}, {1, {3, 2}}} *)
ciao
  • 25,774
  • 2
  • 58
  • 139
  • @user11875: Thanks for accept, might want to wait for other ideas in the future. Also, updated my post to use lists matching your example (I chose lists as a better generalization) and to fix a gotcha with numeric strings (equal threads, so matches get collapsed), and made it into a function, so just fill in the blanks for s1,s2, and alphabet at the end. – ciao Jan 21 '14 at 03:53
3

For one-at-a-time queries :-

sa = "00011001";
sb = "11001001";

pairs = Transpose[{a, b} =
    ToCharacterCode /@ {sa, sb}];

f = Function[x, {Count[a, x],
      Count[pairs, {x, x}]}]@
    First@ToCharacterCode@# &;

f["0"]

{5, 3}

Chris Degnen
  • 30,927
  • 2
  • 54
  • 108
  • Clean. I'd way over-thought this, ended up with something similar to do all-at-once. +1 – ciao Jan 21 '14 at 04:16
3

I would do it by converting the strings to numbers, then using fast numeric functions BitXor, Unitize, etc.

A definition for one-at-a-time applications:

rkfn[a_String, b_String, x_String] /; Equal @@ StringLength /@ {a, b} :=
  Module[{na, nb, nx, f},
    {na, nb, {nx}} = ToCharacterCode[{a, b, x}];
    f[z_, c_: 1] := Subtract[1, Unitize @ BitXor[z, c na]];
    Tr /@ {f[nx], f[nx, f[nb]]}
  ]

And one for all counts at once with somewhat better efficiency:

rkfn[a_String, b_String] /; Equal @@ StringLength /@ {a, b} :=
  Module[{na, nb, ut, xab},
    {na, nb} = ToCharacterCode[{a, b}];
    ut = Subtract[1, Unitize @ BitXor[##]] &;
    xab = na * ut[na, nb];
    Table[
      {FromCharacterCode @ nx, Tr @ ut[nx, na], Tr @ ut[nx, xab]},
      {nx, Union[na, nb]}
    ]
  ]

Test:

{sa, sb} = {"abbabdaaac", "ababbaaccc"};

rkfn[sa, sb, "b"]
{3, 2}
rkfn[sa, sb]
{{"a", 5, 2}, {"b", 3, 2}, {"c", 1, 1}, {"d", 1, 0}}

This method is very fast; comparing two strings of 15 million characters each:

{sa, sb} = StringJoin /@ RandomChoice[CharacterRange["a", "z"], {2, 15*^6}];

rkfn[sa, sb, "j"] // Timing
{0.312, {576135, 22210}}
rkfn[sa, sb] // Timing // First  (* timing for full alphabet *)
6.162
Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
0
s1 = {0, 0, 0, 1, 1, 0, 0, 1};
s2 = {1, 1, 0, 0, 1, 0, 0, 1};

MapThread[
  Append,
   {
    Tally @ s1,
    Lookup[Counts @ Transpose[{s1, s2}], {{0, 0}, {1, 1}}]
   }]

{{0, 5, 3}, {1, 3, 2}}

eldo
  • 67,911
  • 5
  • 60
  • 168