13

I want to define

isGood[___] = False;

isGood[#] = True & /@ list

where list is a list of several million integers. What's the fastest way of doing this?

Ajasja
  • 13,634
  • 2
  • 46
  • 104
user13253
  • 8,666
  • 2
  • 42
  • 65

5 Answers5

17

Summary: undocumented HashTable is a bit faster (at least in version 9) both in storage and in retrieval than DownValues.

DownValues

list = RandomInteger[{-10^9, 10^9}, 10^6];
ret  = RandomInteger[{-10^9, 10^9}, 10^6];

isGood[___] = False;
Scan[(isGood[#] = True) &, list]; // AbsoluteTiming

(* ==> {3.240005, Null} *)

ClearAll[isGood];
isGood[___] = False;
Do[isGood@i = True, {i, list}]; // AbsoluteTiming

(* ==> {2.350003, Null} *)

On my computer, this takes less then 3 seconds for a million integers if Do is used instead of Scan. Isn't this fast enough?

Retrieval of the results is also quite quick, and is almost independent whether Table or Map is used:

isGood /@ ret; // AbsoluteTiming

(* ==> {1.410002, Null} *)

Table[isGood@i, {i, ret}]; // AbsoluteTiming

(* ==> {1.450002, Null} *)

HashTable

Out of curiosity, I compared this to the undocumented HashTable (mentioned here) and got even better results. Note, that the hash table must be checked for existing value (as list might contain duplicates) otherwise HashTableAdd returns with error. Or it is even better to prefilter list by removing duplicates, but that is omitted here not to bias the comparison.

hash = System`Utilities`HashTable[];
Do[If[
    Not@System`Utilities`HashTableContainsQ[hash, i], 
    System`Utilities`HashTableAdd[hash, i, True] (* last argument can be omitted *)
    ], {i, list}]; // AbsoluteTiming

(* ==> {2.010003, Null} *)

System`Utilities`HashTableContainsQ[hash, #] & /@ ret; // AbsoluteTiming

(* ==> {1.340002, Null} *)

Table[System`Utilities`HashTableContainsQ[hash, i], {i, ret}]; // AbsoluteTiming

(* ==> {1.050001, Null} *)

We see that both storage and retrieval are a bit faster.

Ajasja
  • 13,634
  • 2
  • 46
  • 104
  • I hope you don't mind my edit: I did justice toHashTable. I wonder why it was so slow (~27 sec) in your original example. Table is obviously faster than Scan & With, but I also assume that in version 8 HashTable did not throw an error when encountering a duplicate hash, but silently failed with a slowdown. – István Zachar Jan 11 '14 at 14:30
  • I also expect version 10 associations to be at the performance level of HashTable, but that will need another post. – István Zachar Jan 11 '14 at 14:32
12

Are you sure you want to use UpValues? You can use Dispatch which is pretty fast when generating the lookup table and is equally fast when accessing values:

n = 6;
list = RandomInteger[{0, 10^(n + 1)}, {10^n}];

AbsoluteTiming[disp = Dispatch@Thread[list -> True];]
{1.6220927, Null}
Remove[isGood];
AbsoluteTiming[isGood[___] = False; Scan[(isGood[#] = True) &, list]]
{3.5982058, Null}

Query values:

test = RandomInteger[{0, 10^(n + 1)}, {10^n}];

AbsoluteTiming[Count[test /. disp, True]]
{1.9151096, 94844}
AbsoluteTiming[Count[isGood /@ test, True]]
{1.7601007, 94844}
István Zachar
  • 47,032
  • 20
  • 143
  • 291
7

This seems to be faster to define downvalues

list = RandomInteger[{-100000000, 100000000}, 1000000];
DownValues[isGood] = 
   HoldPattern[isGood[#]] :> True & /@ list; // AbsoluteTiming
isGood[___] = False;
Rojo
  • 42,601
  • 7
  • 96
  • 188
6

My solution is ugly, but task-specific. It builds a bitmap out of machine-sized integers in imperative fashion and uses Compile. This works reasonably in memory usage for ranges that have at least couple percent of True values.

A million integers:

n = 6;
list = RandomInteger[{0, 10^(n + 1)}, {10^n}];

Function itself:

<< Developer`

isGood = With[
     {bits = Floor[Log[2, $MaxMachineInteger + 1]],
      min = Min@list,
      max = Max@list},
     With[
      {bv = Compile[{{list, _Integer, 1}},
          Module[
           {bitvec = Table[0, {(max - min + bits - 1)~Quotient~bits}]},
           Scan[(bitvec[[(# - min)~Quotient~bits + 1]] += 
               2^((# - min)~Mod~bits)) &, Union@list];
           bitvec
           ],
          CompilationOptions -> {"ExpressionOptimization" -> True, 
            "InlineExternalDefinitions" -> True}
          ]@list},
      Compile[{{x, _Integer}},
       min <= x <= max && 
          bv[[(x - min)~Quotient~bits + 1]]~BitAnd~(2^((x - min)~Mod~bits)) != 0,
       CompilationOptions -> {"ExpressionOptimization" -> True, 
         "InlineExternalDefinitions" -> True}
       ]
      ]
     ]; // AbsoluteTiming // First

(* 0.347843 *)

Usage:

isGood /@ list // AbsoluteTiming // First

(* 0.590347 *)

Originally I wanted to solve this problem using bitwise operations of arbitrarily large integers, but the issue with that is that functional programming with bigints has large return value overheads - even when just some individual bits are twiddled.

kirma
  • 19,056
  • 1
  • 51
  • 93
  • 2
    A few possible tweaks: use bits = BitLength[$MaxMachineInteger] to compute the maximum bit length, and you can use bitvec = ConstantArray[0, Quotient[max - min + bits - 1, bits]] for initialization. – J. M.'s missing motivation Oct 27 '12 at 01:13
3

What you aim at can be done quicker now with Association. I compare only to the Do method, which is also reasonably fast:

list = RandomInteger[{-10^9, 10^9}, 10^6];
ret = RandomInteger[{-10^9, 10^9}, 10^6];
ClearAll[isGood];
isGood[___] = False;
Do[isGood@i = True, {i, list}]; // AbsoluteTiming
isAlsoGood = With[{data = Union[list]},
    AssociationThread[data -> ConstantArray[True, Length[data]]]
]; // AbsoluteTiming

(* 1.43326 *)
(* 0.673688 *)

And looking up values is even four times faster:

aa = Map[isGood, ret]; // AbsoluteTiming // First
bb = Lookup[isAlsoGood, ret, False]; // AbsoluteTiming // First
aa == bb

(* 0.959888 *)
(* 0.2606 *)
(* True *)

Even deallocation is considerably faster:

ClearAll[isGood]; // AbsoluteTiming // First
ClearAll[isAlsoGood]; // AbsoluteTiming // First

(* 0.656773 *)
(* 0.279496 *)
Henrik Schumacher
  • 106,770
  • 7
  • 179
  • 309