I have a data1 with 4 columns and would like to remove the duplications by looking at the first 3 columns. Thus I used below code:-
data1 = BlockRandom[SeedRandom[12]; RandomInteger[2, {600, 4}]];
select1 = Timing@DeleteDuplicates[data1, {#1[[1]], #1[[2]], #1[[3]]} == {#2[[1]], #2[[2]], #2[[3]]} &]
The code looks okay, but when the range of variety and number of data increase, it will slow down very quickly. There must be a quicker way, since if I removed all the noncomparable columns first before doing DeleteDuplicates, that will be more than 10 times faster:-
select2 = Timing@DeleteDuplicates@data1[[All, 1 ;; 3]]
But the drawback of select2 is, I will lose some of the columns. Of course, I can add the lost columns back to select2, but the codes would become very clumsy and I wonder if there are some other ways to get it done.
Many thanks!
DeleteDuplicatesBy[data1, #[[;; 3]] &]is certainly more readable, and I would expect faster as well. – MarcoB Jun 15 '18 at 20:12DeleteDuplicatesByversion still edges out a narrow win over the faster of your twoGroupByapproaches, at least on my machine :-) – MarcoB Jun 15 '18 at 20:36