13

I have two lists as the following:

ab = {1, -1, -1, -1, 1, 1};
ac = {1, -1, -1,  1, 1, 1};

How I can find the difference (more precisely, the edit distance) between them? In this case the result should be 1, since there is one item difference between ab and ac.

Note: in my case, the list elements only take the values 1 and -1, and both lists are one-dimensional of the same length, but it is always nice to see more general solutions (elements are of Reals, lists are matrices, etc.).

Thank you.

Artes
  • 57,212
  • 12
  • 157
  • 245
sky-light
  • 827
  • 1
  • 12
  • 24

6 Answers6

22

There is an appropriate metrics:

HammingDistance[ab, ac]
1

one could use also (but in general it yields different results since it counts transpositions, deletions etc.)

DamerauLevenshteinDistance[ab, ac]
1
Artes
  • 57,212
  • 12
  • 157
  • 245
  • Strangely HammingDistance doesn't work for mixed data i.e having both Integer and Symbols. – Pankaj Sejwal Sep 06 '13 at 16:20
  • 1
    @Blackbird What problems have you encountered? How many should it be HammingDistance[{3, 2, a, 10, b, 5, 11, c}, {3, 2, a, 11, b, Pi, 11, 3}] ? 3, shouldn't it? – Artes Sep 06 '13 at 16:50
  • I am sorry for reading the definition in wrong way.My bad, you are correct. – Pankaj Sejwal Sep 06 '13 at 18:05
  • 1
    @Blackbird Note that HammingDistance works also for strings (there is useful the IgnoreCase option) unlike the other methods. – Artes Sep 06 '13 at 18:31
16

For Integer data we also could write:

 Tr @ Unitize @ BitXor[ab, ac]
1

For Real data we can use the slightly slower but also shorter:

Tr @ Unitize[ab - ac]

Blackbird challenged me to provide a method that works on all input types. My approach is to select between methods depending on data.

diff[a__?(VectorQ[#, IntegerQ] &)] := Tr @ Unitize @ BitXor @ a
diff[a__?(VectorQ[#, NumericQ] &)] := Tr @ Unitize @ Subtract @ a
diff[a_, b_] := HammingDistance[a, b]

Timings for some of the methods posted so far (search the site for timeAvg):

{ab, ac} = List @@ RandomInteger[2, {2, 250000}];  (* List @@ to prevent unpacking *)

HammingDistance[ab, ac]                  // timeAvg
Count[MapThread[Equal, {ab, ac}], False] // timeAvg
Tr @ Unitize @ BitXor[ab, ac]            // timeAvg
diff[ab, ac]                             // timeAvg

0.009984

0.05428

0.0005488

0.0005744

Now with Real data:

{ab, ac} = N /@ {ab, ac};

HammingDistance[ab, ac]                  // timeAvg
Count[MapThread[Equal, {ab, ac}], False] // timeAvg
Tr @ Unitize[ab - ac]                    // timeAvg
diff[ab, ac]                             // timeAvg

0.01872

0.0748

0.00312

0.0021728

(I learned something from this test: Subtract[a,b] is faster than a-b on packed reals.)

Now something unpackable:

{ab, ac} = RandomChoice[CharacterRange["a", "z"], {2, 250000}];

HammingDistance[ab, ac]                  // timeAvg
Count[MapThread[Equal, {ab, ac}], False] // timeAvg
diff[ab, ac]                             // timeAvg

0.005488

0.0524

0.005496

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
9

You could do this

ab = {1, -1, -1, -1, 1, 1};
ac = {1, -1, -1, 1, 1, 1};

EditDistance[ab, ac]

which would give a result even if the lists had different lengths (or whatever).

The documentation says:

EditDistance[u, v] gives the number of one-element deletions, insertions, and substitutions required to transform u to v.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
Stephen Luttrell
  • 5,044
  • 1
  • 19
  • 18
  • I'm sorry, I have to down-vote this. EditDistance is not the same thing. Incidentally it is a far more complex measure and will be unusably slow on long vectors. – Mr.Wizard Sep 06 '13 at 17:03
  • 1
    The OP said “How I can find the difference between them? Now in our case the result should be 1, since there is one item difference between ab and ac.”, which is such a loose definition of “difference” that EditDistance is a feasible candidate. Though I agree with your comment about complexity. – Stephen Luttrell Sep 07 '13 at 11:40
  • 1
    Okay, I have to admit you're right. My impression of the question didn't allow for that but I was wrong. I removed my down-vote (I had to edit to do this). – Mr.Wizard Sep 07 '13 at 11:45
8

A very basic approach:

Count[Equal @@@ Thread[{ab, ac}], False]

1

or perhaps:

Count[MapThread[Equal, {ab, ac}], False]

1

Now if there is only 1 and -1 to watch out for, this will also do (thanks to Aky for pointing out a glaring error):

Plus @@ Abs[ab - ac]/2
Yves Klett
  • 15,383
  • 5
  • 57
  • 124
7

These results are specific to the case where the data are 1|-1, and may be specific to v6.

<<Developer`
n = 10^7
PackedArrayQ[ a = RandomInteger[1,n]*2 - 1 ]
PackedArrayQ[ b = RandomInteger[1,n]*2 - 1 ]
(* 10000000
   True
   True *)

AbsoluteTiming[ (Length@a - a.b)/2  ]
(* {0.435134, 4998582} *)

AbsoluteTiming[ Tr@Unitize@Subtract[a,b] ]
(* {0.792662, 4998582} *)

AbsoluteTiming[ Tr@Unitize@BitXor[a,b] ]
(* {0.883002, 4998582} *)

aa = FromPackedArray@a;
bb = FromPackedArray@b;

AbsoluteTiming[ (Length@aa - aa.bb)/2  ]
(* {1.373384, 4998582} *)

AbsoluteTiming[ Tr@Unitize@Subtract[aa,bb] ]
(* {1.366143, 4998582} *)

AbsoluteTiming[ Tr@Unitize@BitXor[aa,bb] ]
(* {2.590419, 4998582} *)
Ray Koopman
  • 3,306
  • 14
  • 13
  • I get the same rankings in v7. I hadn't realized that Subtract was faster than BitXor. – Mr.Wizard Sep 07 '13 at 11:50
  • Ah, I see now that it's only faster on this data, not general Integer data. Nevertheless +1 well earned for teaching me something. – Mr.Wizard Sep 07 '13 at 11:58
6

I will abuse the fact that OP hasn't said if it is more general question.

This is my solution fo 1/-1 data case:

Count[ab ac, -1]
Kuba
  • 136,707
  • 13
  • 279
  • 740