Algorithm to compare two large sets

Question

I am a novice in the world of algorithms, ignorant of the taxonomy used.Please pardon me. I have two large sets of numbers A and B where A = {x| 0< x< 9999999999 } B= {y | 0 < y < 9999999999 }. The cardinality of these sets is more than a million. Which is the best algorithm to compute the set differences $A \setminus B$ and $B \setminus A$ ? The sets are sorted ( should I say ordered )? With my pea sized brain , I arrived only at a comparison of every element of A against every element of B , in a linear fashion. Can someone point me to the best way? I would also appreciate and be thankful for "go and read this textbook first/pick a good so -and so textbook first " kind of answers.

When you say A-B and B-A, are you just refering to element-wise differences? — Godric Seer, Nov 13 '13 at 15:25
yes, element wise differences if I can be unambigiuous, a = { 2,4,5,6} b= {8,9,0,2,3} , what I meant by A-B is , {4,5,6} , just curious to learn, are there any other differences that I am missing out? — user917279, Nov 13 '13 at 15:42
That is very different than an elementwise difference. Elementwise difference is A = {1,2,3}, B = {7,6,5}, then A-B = {6,4,2}. You are looking for the set operations $A \setminus B$ — Godric Seer, Nov 13 '13 at 15:46
So sets A and B are not necessarily of equal length? You seem to be looking for something akin to what Matlab's setdiff does. — horchler, Nov 13 '13 at 16:33
A linear search is the best possible complexity, since an item belonging to $A\backslash B$ (resp. to $B\backslash A$) might appear anywhere in lists $A,B$. As you apparently note, the linear search can be realized if $A,B$ are sorted. (I assume that by referring to them as "sets" you imply there are no repeated elements.) — hardmath, Nov 14 '13 at 14:22
Also, to add to the answers below, have a look at this SO question asking the same thing: http://stackoverflow.com/questions/3252667/how-to-calculate-difference-between-two-sets-in-c — Godric Seer, Nov 14 '13 at 16:03

score 5 · Answer 1 · answered Nov 13 '13 at 21:52

5

Python sets are open-addressing hashtables with a prime probe. In other words every set value can be looked up quickly because it was inserted in such a way (hashed) to differentiate it and find it (not order it) from other values. So operations in python like:

a = set([1,2,3,4]) 
b = set([3,4,5,6]) #etc..
a&b
#gives you {3, 4}
a|b
#gives you {1,2,3,4,5,6}
a^b
#gives you {1,2,5,6}
a-b
#gives you {1, 2}

answered Nov 13 '13 at 21:52

Back2Basics

151
2

wasn't sure what you meant by a\b. – Back2Basics Nov 13 '13 at 21:53

rchilton1980 · Answer 2 · 2013-11-13T17:03:40.530

If you're using C++, there is support in the standard library for this (std::set_difference)

http://www.cplusplus.com/reference/algorithm/set_difference/

That documentation even includes equivalent "pseudocode" (really, just more C++) that you can use to port the idea to other languages. The algorithm is close to the "merging" part of mergesort. Note that std::set_difference operates upon sorted ranges, not std::sets (this is a good thing - means sorted std::vector's are adequate).

hardmath · Answer 3 · 2013-11-14T23:29:35.273

One cannot find the set differences with less than a linear search, as an entry appearing anywhere in either set might belong to $A\backslash B$ or to $B\backslash A$ (or neither).

An algorithm which achieves this limit (on sorted lists of data) is as follows:

Given input sets A,B, each a strictly ascending list. 
Initialize A\B and B\A as empty lists.

Until A or B is empty, compare the heads of both lists
    If the heads are equal, remove them.
    If the head of A preceeds the head of B, remove the head from A
       and include it in A\B.
    If the head of B preceeds the head of A, remove the head from B
       and include it in B\A.

Now A or B is empty.
Transfer entries left in A to A\B or entries left in B to B\A.

In Prolog, assuming sorted lists A,B of distinct numeric values:

/*  set_differ(A,B,AminusB,BminusA)  */
set_differ([ ],B,[ ],B) :- !.
set_differ(A,[ ],A,[ ]) :- !.
set_differ([H|A],[H|B],AminusB,BminusA) :-
    !,
    set_differ(A,B,AminusB,BminusA).
set_differ([Ha|A],[Hb|B],AminusB,BminusA) :-
    ( Ha < Hb )
      -> ( AminusB = [Ha|A_B], set_differ(A,[Hb|B],A_B,BminusA) )
      ;  ( BminusA = [Hb|B_A], set_differ([Ha|A],B,AminusB,B_A) ) .

With tail-recursion this is memory efficient.

Algorithm to compare two large sets

3 Answers3