10

Is there any way to instruct Complement to skip the sorting part? If the answer is no (likely), the next question would be: how can I remove from a sorted list another sorted list (in an efficient way)?

I would like the output of the following code to be {1,3} as in the usual Complement function:

universe = {1, 2, 3};
subst = {2} ;
myComplement[universe, subst]

But the output of the following code should be {3, 2, 1} because it stopped trying to remove terms after it realized thatsubst[[1]] > universe[[1]].

universe = {3, 2, 1};
subst = {2};
myComplement[universe, subst]

This question is different from a simple use of Complement or DeleteCases because we must use the fact that the lists are (assumed) sorted for efficiency. In the example below, the list universe has $2^{25}-1$ elements.


The motivation for this question.

While answering this post, I tried to go from listing the classes of equivalence of 4x4 matrices (as requested by the original question) to 5x5 case (just to push it). My answer solved the 4x4 case in 4 seconds. But the same algorithm, I estimated, will take 3 hours for the 5x5 case.

toMatrix = Partition[IntegerDigits[#, 2, 25], 5] &;
toInteger = FromDigits[Flatten[#], 2] &;
allSyms = 
  Module[{s1, s3, s4, s6}, {#, s1 = Reverse[#, {2}], Reverse[#], 
      s3 = Transpose[#], s4 = Reverse[s3, {2}], 
     Reverse[Transpose[s1], {2}], s6 = Reverse[Transpose[s4], {2}], 
     Reverse[Transpose[s6], {2}]}] &;
casesToCheck = Range[0, 2^25 - 1];
Timing[answer = {MatrixForm@toMatrix@First[#], 
     Length[#]} & /@ (Reap[
      NestWhile[
       Complement[#, 
         Sow[Union[toInteger /@ allSyms[toMatrix[#[[1]]]]]]] &, 
       casesToCheck, (n = Length[#]) > 0 &]][[2, 1]]); Length[answer]]

Obviously, there is a major bottleneck that is triggered now. I guess it is because Complement sorts the input. Instead of sorting a list of 65536 integers, the 5x5 case deals with a list of 33554432 integers.

Hector
  • 6,428
  • 15
  • 34
  • 1
    Can you simplify your question somewhat and supply some simple test cases and desired results? – Yves Klett Aug 14 '13 at 07:49
  • 2
    Related / possible duplicate (see answers for unsorted complement): http://mathematica.stackexchange.com/q/1290/131. In short, DeleteCases[universe, Alternatives @@ subst] should do what you want (no clue about performance). – Yves Klett Aug 14 '13 at 08:22
  • @Yves I am not voting to close. This is closely related, but different. A complement should not include duplicate elements, yet e.g. DeleteCases[{3, 5, 7, 1, 4, 5}, 3 | 2 | 7] does. On the other hand there is a faster method for a true complement which I have posted in an answer. – Mr.Wizard Aug 14 '13 at 08:32
  • Wait a minute; I just read "But the output of the following code should be {3,2,1} because it stopped trying to remove terms after it realized thatsubst[[1]]>universe[[1]]." again; this I don't understand. This does not sound like a complement at all. – Mr.Wizard Aug 14 '13 at 08:44
  • @Mr.Wizard not regarding the conflicting question title and content you are correct - there would be need for another DeleteDuplicates. – Yves Klett Aug 14 '13 at 08:48
  • @Yves at least in v7 my orderedComplement is much faster than oc2[all_List, i__List] := DeleteCases[DeleteDuplicates @ all, Alternatives @@ Union[i]] -- can you confirm in a later version? – Mr.Wizard Aug 14 '13 at 08:56
  • @Mr.Wizard depending on the number of elements your version is about two times faster (1000 elements) and drawing equal at 10^7 elements... – Yves Klett Aug 14 '13 at 09:21
  • @YvesKlett Boy, they sure did improve the speed of that in later versions. It's orders of magnitude different in v7. – Mr.Wizard Aug 14 '13 at 09:22
  • There were issues in 7 with DeleteDuplicates as well (e.g. vs. Tally) which obviously were fixed in the meantime. – Yves Klett Aug 14 '13 at 09:25
  • Hector, would you mind commenting on @Mr.Wizard´s question, specifically: should the output for {3,2,1} and {2} be {3,1} or {3,2,1} (which does not really make sense) and edit the question accordingly? – Yves Klett Aug 15 '13 at 07:22
  • 1
    @Mr.Wizard "the output of ..." refers to the fact that you can assume that the input lists have already been sorted. I'll edit the question to make this more specific. – Hector Aug 15 '13 at 13:08
  • @Hector Regarding your update, since applying Sort to an already sorted list is quite fast (your 2^25 example takes under 1.5 second here) I doubt you're going to be able to beat Complement by much. If both lists are sorted and internally free of duplicates this code may be a little faster than Complement: DeleteDuplicates[Join@##] ~Drop~ Length[#] &[subst, universe]. Do you have reason to believe a faster method is even possible? – Mr.Wizard Aug 15 '13 at 13:27
  • @Mr.Wizard Let me do some timings. As for a reason to believe that there is a faster method, I think that if we remove the redundant part from an algorithm it will run faster. The problem is that Mathematica does not have the tools to remove the redundant part from Complement. – Hector Aug 15 '13 at 13:36
  • @Hector I disagree. First, I don't think it is redundant; I believe that the complement operation is done during the sort. Second, my proposed method (in the above comment) does not do any sorting and it's pretty much the same speed (on presorted lists) as Complement. I'm afraid you're seeking the impossible here, but I'm always happy to be proven wrong if I can learn from it. – Mr.Wizard Aug 15 '13 at 13:38
  • Thanks for the Accept. I'm sorry I don't have anything more satisfying for you. I included my simplified version in my answer and preserved the original for anyone who comes upon this question and has the other meaning in mind. – Mr.Wizard Aug 15 '13 at 14:40
  • 2
    Complement constructs a new list each time it is invoked. This makes your algorithm's running time quadratic in the size of casesToCheck, even if Complement runs in constant time. Since your goal is to pull out equivalence classes, it would be better to traverse casesToCheck once, sowing each element which is the first element in its (presorted) equivalence class. This is a linear- – Tobias Hagge Aug 15 '13 at 14:59
  • 1
    -(oops)- time algorithm since the equivalence classes have a constant bound on their size. – Tobias Hagge Aug 15 '13 at 15:38

2 Answers2

8

Revised answer

For the true meaning of the question as it has now been clarified:

I do not believe you are likely to be able to greatly improve over Complement as I believe the sort is not superfluous but rather an integral part of the algorithm. I can only offer my orderedComplement without the pre-processing; it may in some cases be faster as it uses a different algorithm that does not sort:

presortedComplement[all_List, i_List] :=
  DeleteDuplicates[Join[i, all]] ~Drop~ Length[i]

This assumes an input of (only) two sorted lists of unique elements as would be output by Union.


Original answer

I propose:

orderedComplement[all_List, i__List] := 
  DeleteDuplicates[Join @ ##] ~Drop~ Length[#] &[Union @ i, DeleteDuplicates @ all]

orderedComplement[{3, 5, 7, 1, 4, 5}, {3, 2}, {7}]
{5, 1, 4}

As Leonid described, in version 8 DeleteCases has been optimized in such a fashion that it may have acceptable speed, and it is conceptually simpler; in version 7 it is orders of magnitude slower:

ocV8[all_List, i__List] := DeleteCases[DeleteDuplicates @ all, Alternatives @@ Union[i]]
Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
3

This may be faster than using Alternatives:-

universe = {3, 2, 4, 2, 7, 1};
subst = {2, 7, 8};

myComplement[universe_, subst_] := Module[{test},
  Scan[(test@# = True) &, subst];
  DeleteCases[universe, _?test]]

myComplement[universe, subst]

{3, 4, 1}

This solution is based on answers from here: Quick multiple selections from a list

Timings

On version 7

universe = RandomInteger[1000000, 100000];
subst = RandomInteger[20000, 20000];

myComplement[universe_, subst_] := Module[{test},
   Scan[(test@# = True) &, Union@subst];
   DeleteCases[DeleteDuplicates@universe, _?test]];

orderedComplement[all_List, i__List] := 
  DeleteDuplicates[Join@##]~Drop~Length[#] &[Union@i, 
   DeleteDuplicates@all];

ocV7[all_List, i__List] := 
  DeleteCases[DeleteDuplicates@all, Alternatives @@ Union[i]];

Print[First@Timing@myComplement[universe, subst], " seconds"];
Print[First@Timing@orderedComplement[universe, subst], " seconds"];
Print[First@Timing@ocV7[universe, subst], " seconds"];

0.125 seconds

0.0 seconds

19.765 seconds

On version 9

universe = RandomInteger[1000000, 100000];
subst = RandomInteger[20000, 20000];

myComplement[universe_, subst_] := Module[{test},
   Scan[(test@# = True) &, Union@subst];
   DeleteCases[DeleteDuplicates@universe, _?test]];

ocV9[all_List, i__List] := 
  DeleteCases[DeleteDuplicates@all, Alternatives @@ Union[i]];

Print[First@Timing@myComplement[universe, subst], " seconds"];
Print[First@Timing@orderedComplement[universe, subst], " seconds"];
Print[First@Timing@ocV9[universe, subst], " seconds"];

0.140401 seconds

0.015600 seconds

0.031200 seconds

Chris Degnen
  • 30,927
  • 2
  • 54
  • 108
  • (1) The question has been altered to mean something rather different (from what I understood it to be). (2) Have you actually timed this? – Mr.Wizard Aug 15 '13 at 14:03
  • I've just done the timings... Interesting result for version 7 users. – Chris Degnen Aug 15 '13 at 14:12
  • Chris, I'm not trying to be an ass (but I'm probably succeeding...) but I'm not seeing the point of this answer. Alternatives in v7 is horribly slow, yes. As shown in v9 it's a lot better. And in both versions I believe orderedComplement is fastest of all. More importantly this doesn't answer the question in its present state; if it were not for that I'd give my vote because I like alternatives but as it is it doesn't improve upon my answer for the old question or answer the new one. // Off[grumpy-pants] & – Mr.Wizard Aug 15 '13 at 14:15
  • I initially thought your ocV8 was a simpler case of your orderedComplement solution. I'll see if I can improve my answer. – Chris Degnen Aug 15 '13 at 14:20
  • @Mr.Wizard The question has not changed: how can I remove from a sorted list another sorted list (in an efficient way). I added to emphasize the "sorted" assumptions but those requirements were always there. By the way, your answer is 6% faster than Complement. Also, thanks for all the effort. I try to give back by answering some questions on my own. – Hector Aug 15 '13 at 14:21
  • @Hector as I said "from what I understood it to be." In the original wording I took sorted to mean ordered as that seemed the more likely meaning. – Mr.Wizard Aug 15 '13 at 14:25
  • Have added orderedComplement timings. Ambivalent about deleting. Funnily enough "> 0. seconds" rendered as "1. seconds"! (Try it in edit mode and see!) – Chris Degnen Aug 15 '13 at 14:34
  • @ChrisDegnen I appreciate the effort. I'll take Mr. Wizard's as the answer, but I am upvoting yours too. – Hector Aug 15 '13 at 14:37
  • Chris, I'm giving a +1 for the comparative timings that I was unable to provide myself. – Mr.Wizard Aug 15 '13 at 14:44