How to increase the speed of DeleteDuplicates on some selected columns?

Question

I have a data1 with 4 columns and would like to remove the duplications by looking at the first 3 columns. Thus I used below code:-

data1 = BlockRandom[SeedRandom[12]; RandomInteger[2, {600, 4}]];
select1 = Timing@DeleteDuplicates[data1, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &]

The code looks okay, but when the range of variety and number of data increase, it will slow down very quickly. There must be a quicker way, since if I removed all the noncomparable columns first before doing DeleteDuplicates, that will be more than 10 times faster:-

select2 = Timing@DeleteDuplicates@data1[[All, 1 ;; 3]]

But the drawback of select2 is, I will lose some of the columns. Of course, I can add the lost columns back to select2, but the codes would become very clumsy and I wonder if there are some other ways to get it done.

Many thanks!

DeleteDuplicatesBy[data1, #[[;; 3]] &] is certainly more readable, and I would expect faster as well. — MarcoB, Jun 15 '18 at 20:12
You should wait to mark something as accepted. I put up an answer that is actually a little faster than Marco's. :) — rcollyer, Jun 15 '18 at 20:30
@rcollyer At larger scale, e.g. with 2,000,000 points, the DeleteDuplicatesBy version still edges out a narrow win over the faster of your two GroupBy approaches, at least on my machine :-) — MarcoB, Jun 15 '18 at 20:36
@MarcoB on my machine, the switchover is at approx. 425000 points. And, Henrik's is an order of magnitude faster than ours, with the caveat of not being order preserving. If that is not needed, use his, otherwise use mine for shorter lengths, and yours for longer. Win. — rcollyer, Jun 15 '18 at 20:48

score 6 · Accepted Answer · answered Jun 15 '18 at 20:15

Use DeleteDuplicatesBy with an appropriate function that selects only the first three columns:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];

(* Yours *)
AbsoluteTiming[
 DeleteDuplicates[moredata, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];
]
(* 5.38 s *)


(* Slightly improved, still using DeleteDuplicates *)
AbsoluteTiming[DeleteDuplicates[moredata, #1[[;; 3]] == #2[[;; 3]] &];]
(* 2.44 s *)


(* DeleteDuplicatesBy *)
AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
(* 0.05 s*)

The latter takes 0.05 seconds whereas your original formulation takes roughly 5 seconds, so a 100x speedup.

Henrik Schumacher · Answer 2 · 2018-06-15T21:01:35.603

6

This is not meant completely seriously, but in some performance-critical situations, SpareArrays can perform miracles.

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
a = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; // RepeatedTiming // First

b = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
     With[{p = moredata[[All, 1 ;; 3]] + 1}, 
      SparseArray[p -> Range[Length[p]], Max /@ Transpose[p]]
     ]
    ]; // RepeatedTiming // First

Sort@a == Sort@b

0.041

0.0039

True

Not that you can save even more time id you now the maximal dimensions of the SparseArray in advance.

c = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
     With[{p = moredata[[All, 1 ;; 3]] + 1}, 
      SparseArray[p -> Range[Length[p]], {3,3,3}]
     ]
    ]; // RepeatedTiming // First

0.00335

I'd like to add that my approach exploits several assumptions, so it has some severe limitations:

All entries in moredata[[All, 1 ;; 3]] are nonnegative integers, so that they can be used as sparsity pattern of a SparseArray (after adding 1).
The ordering of the result does not really matter to you.

edited Jun 15 '18 at 21:01

answered Jun 15 '18 at 20:38

Henrik Schumacher

106,770
7
179
309

1

what's bla in Range[Length[bla]]? – rcollyer Jun 15 '18 at 20:42
2

Sorry. My generic variable when I am uncreative. I forgot to rename it. Fixed it. – Henrik Schumacher Jun 15 '18 at 20:43
Thanks. I'll test it now. :) – rcollyer Jun 15 '18 at 20:44
2

The only caveat with yours is that it is not order preserving. But, still lightning fast. – rcollyer Jun 15 '18 at 20:47
1

@rcollyer Yeah, there is a price for everything you know... ;) – Henrik Schumacher Jun 15 '18 at 20:49

score 5 · Answer 3 · answered Jun 15 '18 at 20:29

Here are a couple of alternatives using other functions.

In[64]:= moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];

In[86]:= AbsoluteTiming[DeleteDuplicates[moredata, 
  {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];]
Out[86]= {4.73677, Null}

(* Marco's DeleteDuplicates *)
In[88]:= AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
Out[88]= {0.044251, Null}

(* a Reap/Sow based solution *)
In[80]:= Reap[Sow[#, {#[[;; 3]]}] & /@ moredata][[2, All, 1]]; // AbsoluteTiming
Out[80]= {0.161927, Null}

(* GroupBy using the third argument to reduce the results *)
In[82]:= Values@GroupBy[moredata, (#[[;; 3]] &), First]; // AbsoluteTiming

Out[82]= {0.041241, Null}

(* GroupBy using Part instead of First *)
In[81]:= Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
Out[81]= {0.038312, Null}

The GroupBy approach is pretty cool. I would not have thought of that! — MarcoB, Jun 15 '18 at 20:40

score 2 · Answer 4 · answered Jun 16 '18 at 15:56

Here is a refinement of @rcollyer's answer using my GatherByList function. The function is short:

GatherByList[list_, representatives_] := Module[{func},
    func /: Map[func, _] := representatives;
    GatherBy[list, func]
]

And using GatherByList:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];

(* MarcoB *)
r1 = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; //AbsoluteTiming
(* rcollyer *)
r2 = Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
(* GatherByList *)
r3 = GatherByList[moredata, moredata[[All,;;3]]][[All, 1]]; //AbsoluteTiming

r1 === r2 === r3

{0.047484, Null}

{0.043889, Null}

{0.029294, Null}

True

Not as fast as @Henrik's answer, of course.

How to increase the speed of DeleteDuplicates on some selected columns?

4 Answers4