12

Having millions of numbers in a list, I want to take top K occurrences. MMA Commonest is very slow so I write my own versions:

a=RandomInteger[{1,1000000},50000000];

MyCommonest1[a_,n_]:=Take[SortBy[Tally[a],Last],-n]

MyCommonest2[a_,n_]:=(b=Tally[a];Take[b[[Reverse[Ordering[b[[All,2]]]]]],n])

MyCommonest3[a_,n_]:=(b=Tally[a];b[[Take[Ordering[b[[All,2]]],-n]]])

MyCommonest4[a_,n_]:=Take[SortBy[Tally[a],-#[[2]]&],n]

Timings:

1.68
1.12
1.15
1.70

Can it be any faster?

Edit

The C# counterpart of this problem in SO is here.

Mohsen Afshin
  • 985
  • 1
  • 5
  • 17

3 Answers3

11

The only improvement I can think of:

comm[a_, n_] := #[[ Ordering[#[[All, 2]], -n] ]] & @ Tally[a]

Or if you don't want the counts:

comm2[a_, n_] := #[[ Ordering[#2, -n] ]] & @@ (Tally[a]\[Transpose])

Test:

MyCommonest2[a, 15] // Timing // First
comm[a, 15]         // Timing // First
comm2[a, 15]        // Timing // First
0.3276

0.3056

0.2932

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
  • Try sorting the list, then tally... might be surprised, ~40% boost in some short tests. – ciao May 01 '14 at 21:17
  • @rasher Sort is much slower than Tally on the OP's input list, at least in v7. What data are you using? – Mr.Wizard May 01 '14 at 21:43
  • e.g., RandomInteger[2000000,2000000]. On the loungebook, might be a cache locality effect, but it's consistent, and even larger with unpacked target since sort packs it... interesting. – ciao May 01 '14 at 21:45
  • Fiddled with the above some more: Same on WS, but in both cases, only when duplication is low does this happen, otherwise sort slows things dramatically. Probably useless effect, but interesting nonetheless. – ciao May 02 '14 at 03:47
7

Another one, seems roughly on par with Mr Wizard's comm2 on my machine.

comm3[a_, n_] := Pick[#1, UnitStep[#2 - RankedMax[#2, n]], 1]& @@ Transpose[Tally @ a]
Simon Woods
  • 84,945
  • 8
  • 175
  • 324
7
jsat = #1[[Join @@ SparseArray[
  Threshold[#2,{"LargestValues",15}]]["NonzeroPositions"]]]& @@ (Tally[#]\[Transpose])&;

This returns more than k elements because of the ties.

SeedRandom[1];
a = RandomInteger[{1, 1000000}, 50000000];
jsat[a, 15] // Timing // First
comm[a, 15] // Timing // First
comm2[a, 15] // Timing // First
comm3[a, 15] // Timing // First
MyCommonest2[a, 15] // Timing // First

0.64062
0.75000
0.75000
0.71875
0.78125
kglr
  • 394,356
  • 18
  • 477
  • 896