20

If I do

ClearAll[a, d]
lsts = {{a, d}, {a, d}};
Union[lsts]

I get the expected answer

{{a, d}}

but if I do

lsts = {{a, d}, {d, a}};
Union[lsts]

I get

{{a, d}, {d, a}}

Since I am using Union, I thought the order of the lists would not matter. Hence to get around this, now I always add Sort first, like this

lsts = {{a, d}, {d, a}};
Union[Sort /@ lsts]

and now I get expected answer

{{a, d}}

Question: Is this the right way to approach this? or do you recommend a better way?

Brett Champion
  • 20,779
  • 2
  • 64
  • 121
Nasser
  • 143,286
  • 11
  • 154
  • 359

7 Answers7

14

Sorting of sub-lists seems unavoidable since this is what brings them to a "canonical form" in this problem. If you don't care about the order of your resulting sub-lists, you could used DeleteDuplicates in place of Union though - this should be faster for large lists.

Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
  • Good call, I didn't think of that. And in fact DeleteDuplicates allows a custom equality test, so you could only sort the list in the test if you wanted your results in unsorted form. – David Z Jan 17 '12 at 23:06
  • @David Be careful with these equality tests, they often lead to slowdowns which are not immediately obvious. For Union such custom-function-induced slowdowns will be much more severe though, see e.g. this discussion: http://www.mathprogramming-intro.org/book/node290.html – Leonid Shifrin Jan 17 '12 at 23:08
  • True, I should have mentioned that a custom equality test could make the command significantly slower. – David Z Jan 17 '12 at 23:10
  • Hmmm ... the example on mathprogramming-intro.org isn't even an equivalence relation. I'd expect all sorts of strange things to happen in such a case, and especially I'd not expect timings to be in any way representative for the timing you get with functions having the correct semantics. – celtschk Jan 18 '12 at 12:00
  • @celtschk My bad - the first example doesn't hold water indeed. You are actually the first person to point it out - thanks. The other examples down that pages are ok though, I believe, and still illustrate my main point. I also discussed this problem here: http://forums.wolfram.com/mathgroup/archive/2009/Jul/msg00057.html, and in that thread there are other answers with good points. – Leonid Shifrin Jan 18 '12 at 12:12
  • Actually, the best example in that post is the one by J. Siehler where he uses Equal: Here as far as I can tell the exact same operation is done as without SameTest, using directly the built-in function, therefore the difference cannot be in any inefficiency of the test itself. But the run time is a factor of 200 larger. OK, thinking again, without the test it might infer equality by the ordering relation (i.e. consider elements equivalent if neither is smaller than the other), just like the C++ STL routines do. – celtschk Jan 18 '12 at 14:10
  • 1
    @celtschk Whenever any user-defined function is supplied (even when it happens to be a built-in), Union switches to quadratic-time algorithm based on pairwise comparisons. This is because Union accepts sameness function, not a comparison function (the existence of the latter is a stronger requirement). Whenever it is not specified, it so happens that SameQ (default) has an accompanying comparison function based on canonical sort, so it uses n*log n sorting then. I actually explained this in considerable detail in my post in that thread. – Leonid Shifrin Jan 18 '12 at 14:29
  • Ah, thanks, I didn't yet have the time to read that post in detail. – celtschk Jan 18 '12 at 15:01
12

You can provide a custom SameTest to Union where you can take advantage of your knowledge what should qualify as equal, for example:

In[1] := Union[{{a,d},{d,a}}, SameTest -> (Complement[##] === {}&)]
Out[1] = {{a, d}}
Thies Heidecke
  • 8,814
  • 34
  • 44
11

It might be that you're slightly misunderstanding what Union does. It finds the union of the elements of the list that is passed to it, but it doesn't dig into lists within that list. So when you write Union[{{a,d},{a,d}}], the function sees a list with two elements, {a,d} (that's element 1) and {a,d} (that's element 2). They are the same, so it removes the duplicate and returns just {a,d}. But when you write Union[{{a,d},{d,a}}], it sees a list with two different elements: {a,d} (that's element 1) and {d,a} (that's element 2). The fact that those two lists contain the same items is irrelevant; they're not equal, according to an ordered element-by-element comparison, so Union has no duplicates to remove.

Now, it seems like what you're trying to do is get all lists which are distinct in terms of their content, irrespective of order - in other words, you're treating the lists as mathematical sets. I think Union[Sort/@lsts] should be a fine way to go, because that's the standard method of comparing sets for equality when you don't have an actual unordered set type. (If Mathematica does, I don't know about it.)

David Z
  • 4,921
  • 4
  • 25
  • 35
  • 2
    If he wants to treat the lists as mathematical sets (i.e. also consider duplicates as irrelevant), then using Union also for sorting would be a better idea (because it removes duplicates). That is, for the list lsts={{a, b}, {a, b, b}}, Union[Sort/@lsts] will give lsts back, while Union[Union/@lsts] will give just {{a,b}}. – celtschk Jan 18 '12 at 07:23
7

You could do the following:

lists = {{c, b, a}, {c, a, b}};
Union[lists, SameTest -> (Sort[#1] == Sort[#2] &)]

Note that the result is {c,a,b}, which is unsorted. The underlying algorithm can no longer take advantage of a linear comparison of the terms, however. As a result the time complexity is quadratic and will slow down your code considerably for very long lists. Thus, I'd advise against this approach. Ordering the lists first, as you've done, is preferable.

Mark McClure
  • 32,469
  • 3
  • 103
  • 161
4

Your problem is reminiscent of the thing one has to do when dealing with noncommutative monomials.

To add to previous answers:

Depending on the typical content of your lists (especially if you have a lot of strictly identical elements), it might be beneficial to apply Union twice, in this way

Union[ Sort/@(Union[ lst ]) ]

or

Union[ Union[ lst], SameTest -> (Equal[Sort[#1],Sort[#2]] &) ]

if you want to retain some of the diversity of the original instead of having everything mapped to a canonically sorted form.

The problem is more complex when you consider more deeply structured lists of course. You might end up with a very costly comparator function.

An object oriented approach would be to define for each object symbol a comparator function that would be called automatically as SameTest when a modified Union is called with arguments of a given head.

ogerard
  • 957
  • 12
  • 17
  • 1
    In testing (which I admit was not comprehensive, but it did cover very small through very large lists), Union[Sort /@ lst] was consistently the fastest approach. I tried various combinations of DeleteDuplicates, Tally, Orderless functions, sequences of tests (e.g., test on Max and then sort), and even Hash. – whuber Jan 18 '12 at 00:53
3

To get rid of items that are duplicates under Sort you may use this:

GatherBy[lsts, Sort][[All, 1]]

Afterward you may sort or manipulate that list as you see fit.
Be warned that there is apparently a bug in Mathematica 7 with this specific code.

New in Mathematica 10:

DeleteDuplicatesBy[lsts, Sort]
Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
1

I am not sure you are doing it right. Union expects a set per argument. As you only give it one argument, you are basically doing the union over one set A, which is incidentally A. What you want to do is Union @@ lsts which is Apply[Union,lsts]

rm -rf
  • 88,781
  • 21
  • 293
  • 472
niklasfi
  • 2,613
  • 3
  • 22
  • 18