5

I have a data1 with 4 columns and would like to remove the duplications by looking at the first 3 columns. Thus I used below code:-

data1 = BlockRandom[SeedRandom[12]; RandomInteger[2, {600, 4}]];
select1 = Timing@DeleteDuplicates[data1, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &]

The code looks okay, but when the range of variety and number of data increase, it will slow down very quickly. There must be a quicker way, since if I removed all the noncomparable columns first before doing DeleteDuplicates, that will be more than 10 times faster:-

select2 = Timing@DeleteDuplicates@data1[[All, 1 ;; 3]]

But the drawback of select2 is, I will lose some of the columns. Of course, I can add the lost columns back to select2, but the codes would become very clumsy and I wonder if there are some other ways to get it done.

Many thanks!

H42
  • 3,469
  • 7
  • 17
  • DeleteDuplicatesBy[data1, #[[;; 3]] &] is certainly more readable, and I would expect faster as well. – MarcoB Jun 15 '18 at 20:12
  • You should wait to mark something as accepted. I put up an answer that is actually a little faster than Marco's. :) – rcollyer Jun 15 '18 at 20:30
  • 1
    @rcollyer At larger scale, e.g. with 2,000,000 points, the DeleteDuplicatesBy version still edges out a narrow win over the faster of your two GroupBy approaches, at least on my machine :-) – MarcoB Jun 15 '18 at 20:36
  • 2
    @MarcoB on my machine, the switchover is at approx. 425000 points. And, Henrik's is an order of magnitude faster than ours, with the caveat of not being order preserving. If that is not needed, use his, otherwise use mine for shorter lengths, and yours for longer. Win. – rcollyer Jun 15 '18 at 20:48

4 Answers4

6

Use DeleteDuplicatesBy with an appropriate function that selects only the first three columns:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];

(* Yours *)
AbsoluteTiming[
 DeleteDuplicates[moredata, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];
]
(* 5.38 s *)


(* Slightly improved, still using DeleteDuplicates *)
AbsoluteTiming[DeleteDuplicates[moredata, #1[[;; 3]] == #2[[;; 3]] &];]
(* 2.44 s *)


(* DeleteDuplicatesBy *)
AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
(* 0.05 s*)

The latter takes 0.05 seconds whereas your original formulation takes roughly 5 seconds, so a 100x speedup.

MarcoB
  • 67,153
  • 18
  • 91
  • 189
6

This is not meant completely seriously, but in some performance-critical situations, SpareArrays can perform miracles.

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];
a = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; // RepeatedTiming // First

b = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
     With[{p = moredata[[All, 1 ;; 3]] + 1}, 
      SparseArray[p -> Range[Length[p]], Max /@ Transpose[p]]
     ]
    ]; // RepeatedTiming // First

Sort@a == Sort@b

0.041

0.0039

True

Not that you can save even more time id you now the maximal dimensions of the SparseArray in advance.

c = Join[#["NonzeroPositions"] - 1, moredata[[#["NonzeroValues"], 4 ;;]], 2] &[
     With[{p = moredata[[All, 1 ;; 3]] + 1}, 
      SparseArray[p -> Range[Length[p]], {3,3,3}]
     ]
    ]; // RepeatedTiming // First

0.00335

I'd like to add that my approach exploits several assumptions, so it has some severe limitations:

  1. All entries in moredata[[All, 1 ;; 3]] are nonnegative integers, so that they can be used as sparsity pattern of a SparseArray (after adding 1).

  2. The ordering of the result does not really matter to you.

Henrik Schumacher
  • 106,770
  • 7
  • 179
  • 309
5

Here are a couple of alternatives using other functions.

In[64]:= moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];

In[86]:= AbsoluteTiming[DeleteDuplicates[moredata, 
  {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &];]
Out[86]= {4.73677, Null}

(* Marco's DeleteDuplicates *)
In[88]:= AbsoluteTiming[DeleteDuplicatesBy[moredata, #[[;; 3]] &];]
Out[88]= {0.044251, Null}

(* a Reap/Sow based solution *)
In[80]:= Reap[Sow[#, {#[[;; 3]]}] & /@ moredata][[2, All, 1]]; // AbsoluteTiming
Out[80]= {0.161927, Null}

(* GroupBy using the third argument to reduce the results *)
In[82]:= Values@GroupBy[moredata, (#[[;; 3]] &), First]; // AbsoluteTiming

Out[82]= {0.041241, Null}

(* GroupBy using Part instead of First *)
In[81]:= Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
Out[81]= {0.038312, Null}
rcollyer
  • 33,976
  • 7
  • 92
  • 191
2

Here is a refinement of @rcollyer's answer using my GatherByList function. The function is short:

GatherByList[list_, representatives_] := Module[{func},
    func /: Map[func, _] := representatives;
    GatherBy[list, func]
]

And using GatherByList:

moredata = BlockRandom[SeedRandom[12]; RandomInteger[2, {60000, 4}]];

(* MarcoB *)
r1 = DeleteDuplicatesBy[moredata, #[[;; 3]] &]; //AbsoluteTiming
(* rcollyer *)
r2 = Values@GroupBy[moredata, (#[[;; 3]] &)][[All, 1]]; // AbsoluteTiming
(* GatherByList *)
r3 = GatherByList[moredata, moredata[[All,;;3]]][[All, 1]]; //AbsoluteTiming

r1 === r2 === r3

{0.047484, Null}

{0.043889, Null}

{0.029294, Null}

True

Not as fast as @Henrik's answer, of course.

Carl Woll
  • 130,679
  • 6
  • 243
  • 355