15

What is the fastest way to make multiple selections from a list? Compiled methods included.

For example, here are two methods for selecting a subset, compared:-

biglist = {{5, "e", 500}, {4, "d", 400},
   {3, "c", 300}, {2, "b", 200}, {1, "a", 100}};
subset = {2, 5, 4};

Cases[biglist, {#, __}] & /@ subset

{{{2, "b", 200}}, {{5, "e", 500}}, {{4, "d", 400}}}

Cases[biglist, {Apply[Alternatives, subset], __}]

{{5, "e", 500}, {4, "d", 400}, {2, "b", 200}}

Only first method returns items in the order matched in the subset list, but it is much slower:-

n = 10000;
biggerlist = Map[{#, FromCharacterCode[Mod[# - 1, 26] + 97], #*100} &,
   Range[n]];
unsortedbiglist = RandomSample[biggerlist, n];
unsortedsubset = RandomSample[Range[n], Round[n/10]];

Row[{First[Timing[selection1 = Map[Cases[unsortedbiglist, {#, __}] &,
       unsortedsubset];]], " seconds"}]

1.123 seconds

Row[{First[Timing[selection2 = Cases[unsortedbiglist,
       {Apply[Alternatives, unsortedsubset], __}];]], " seconds"}]

0.171 seconds

The selections are the same, but differently ordered:-

SameQ[Flatten[selection1, 1],
 Extract[selection2,
  Flatten[Map[Position[First /@ selection2, #] &,
    unsortedsubset], 1]]]

True

Including the sorting routine in the selection process still gives a better timing:-

Row[{First[Timing[selection3 = Function[selection2,
        Extract[selection2,
         Flatten[Map[Position[First /@ selection2, #] &,
           unsortedsubset], 1]]][Cases[unsortedbiglist,
        {Apply[Alternatives, unsortedsubset], __}]];]], " seconds"}]

0.359 seconds

SameQ[Flatten[selection1, 1], selection3]

True

Nevertheless, only the slow Cases method returns {} when there are unmatched subset items, which is sometimes useful.

Ideas for speedy selections would be great.

m_goldberg
  • 107,779
  • 16
  • 103
  • 257
Chris Degnen
  • 30,927
  • 2
  • 54
  • 108

3 Answers3

15

This seems to give a rather decent performance (final version with improvements by jVincent):

Clear[getSubset];
getSubset[input_List,sub_List]:=
Module[{inSubQ,sowMatches},
    Scan[(inSubQ[#] := True)&,sub];
    sowMatches[x_/;inSubQ@First@x] := Sow[x,First@x];
    Apply[Sequence, Last@Reap[Scan[sowMatches, input], sub], {2}]
];

Benchmarks:

n = 10000;
biggerlist =  Map[{#, FromCharacterCode[Mod[# - 1, 26] + 97], #*100} &, Range[n]];
unsortedbiglist = RandomSample[biggerlist, n];
unsortedsubset = RandomSample[Range[n], Round[n/10]];

Row[{First[Timing[selection1=Map[Cases[unsortedbiglist,{#,__}]&,unsortedsubset];]]," seconds"}]

(* 1.170008 seconds  *)

(sel1 = getSubset[unsortedbiglist,unsortedsubset])//Short//Timing

(* {0.031,{{{8286,r,828600}},<<998>>,{{6420,x,642000}}}}  *)

selection1===sel1

(* True *)
Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
7
 Timing[selection3 = Pick[unsortedbiglist, unsortedbiglist[[All, 1]], 
   Alternatives @@ unsortedsubset];]
 (* {0.218401, Null}  -- same as Cases[..., Alternatives@@ ..] *)
 selection3 == selection2
 (* True *)
kglr
  • 394,356
  • 18
  • 477
  • 896
6

Here is my take on Leonid's method. It's better because it's shorter and uses ~infix~. ;-)
(It's just a little bit faster, too: about 20% on his test.)

getSubset2[input_List, sub_List] := Module[{test},
  (test@# = True) & ~Scan~ sub;
  Apply[Sequence,
   Reap[Cases[input, x:{y_?test, ___} :> x ~Sow~ y], sub][[2]],
   {2}
  ]
]

getSubset2[Range@20 ~Partition~ 4, Prime ~Array~ 7]
{{}, {}, {{5, 6, 7, 8}}, {}, {}, {{13, 14, 15, 16}}, {{17, 18, 19, 20}}}

Although slower I cannot pass by the more direct implementation without comment:

getSubset3[input_List, sub_List] :=
  Last @ Reap[# ~Sow~ #[[1]] & ~Scan~ input, sub, Sequence @@ #2 &]

Also slower than getSubset2 but pleasingly clean, Association can be nicely applied to this problem in the form of GroupBy and Lookup.

getSubset4[set_, sub_] := Lookup[set ~GroupBy~ First, sub, {}]

getSubset4[Range@20 ~Partition~ 4, Prime ~Array~ 7]
{{}, {}, {{5, 6, 7, 8}}, {}, {}, {{13, 14, 15, 16}}, {{17, 18, 19, 20}}}
Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371