11

I need to generare a huge number of k-combinations for a quite big set of numbers (more than 100), so I would avoid Subsets because it requires too much memory (it generates all subsets at once). I would create combinations one after another, so to keep only those valid for my purpose. It might be very stupid, but I cannot find any alternative to a nested sequence of Do, that obviously I consider not very efficient. Is there any alternative? Here the example of nested Do:

n=5;
Do[Do[Do[Print@{i,j,k}, {k, j+1, n}], {j, i+1, n-1}], {i, n-2}] 

(* {1,2,3}

{1,2,4}

{1,2,5}

{1,3,4}

{1,3,5}

{1,4,5}

{2,3,4}

{2,3,5}

{2,4,5}

{3,4,5}*)

jkuczm
  • 15,078
  • 2
  • 53
  • 84
bobknight
  • 2,037
  • 1
  • 13
  • 15

1 Answers1

21

Subsets function takes optional third argument with standard sequence specification. Using this third argument you can take subsets "in chunks".

For example, following code gives three 5-combinations from positions 90000 to 90002, from all 8 trillions 5-combinations of set of 1000 elements:

Subsets[Range[1000], {5}, {90000, 90002}]
(* {{1, 2, 3, 98, 845}, {1, 2, 3, 98, 846}, {1, 2, 3, 98, 847}} *)

Lazy subsets

Using undocumented Streaming` module introduced in v10.1 you can implement lazy list of subsets i.e. an object that generally behaves like ordinary list, but it's not a whole list that needs to be stored in memory. Instead, when needed, it generates subsets in "chunks" of desired length.

Here is a very simple version, based on LazyTuples.

Needs["Streaming`"]

ClearAll[lazySubsets]
lazySubsets[list_, nspec_:All, chunkSize:_Integer?Positive:100000] :=
    Module[{ctr = 0, active = False},
        (* Test whether given arguments are valid for Subsets. *)
        Check[
            Quiet[Subsets[list, nspec, {1}], Subsets::take],
            Return[$Failed, Module]
        ];
        LazyListCreate[
            IteratorCreate[
                ListIterator,
                (active = True) &,
                With[
                    {taken =
                         Quiet[
                             Subsets[list, nspec, {ctr + 1, ctr + chunkSize}],
                             Subsets::take
                         ]
                    }
                    ,
                    ctr += Length[taken];
                    taken
                ] &,
                TrueQ[active] &,
                Remove[active, ctr] &
            ]
            ,
            chunkSize
        ]
    ]

Example of usage:

subs = lazySubsets[Range[5], {3}]
(* « LazyList[{1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, ...] » *)

You can iterate over subs as if it was an ordinary list:

Scan[Print, subs]
(* {1,2,3}
   {1,2,4}
   {1,2,5}
   {1,3,4}
   {1,3,5}
   {1,4,5}
   {2,3,4}
   {2,3,5}
   {2,4,5}
   {3,4,5} *)

You can Map a function and get another lazy list:

f /@ subs
(* « LazyList[f[{1, 2, 3}], f[{1, 2, 4}], f[{1, 2, 5}], f[{1, 3, 4}], f[{1, 3, 5}], ...] » *)

Get it's Length or certain Part:

subs // Length
(* 10 *)

subs[[5]]
(* {1, 3, 5} *)

Memory required to use this lazy list depends on given chunkSize. Applying function to ordinary list of all 8 388 608 subsets of set of 23 elements requires over gigabyte of memory to store whole list:

Scan[Identity, Subsets[Range[23]]] // MaxMemoryUsed
(* 1 194 005 632 *)

Applying function to lazy list, that takes 10^5 subsets in chunk, takes much more time, but uses only fifty megabytes:

Scan[Identity, lazySubsets[Range[23], All, 10^5]] // MaxMemoryUsed
(* 55 351 288 *)

Taking 10^4 subsets per chunk uses only seven megabytes of memory:

Scan[Identity, lazySubsets[Range[23], All, 10^4]] // MaxMemoryUsed
(* 7 154 944 *)

Clean up cache after playing with Streaming`:

Scan[LazyListDestroy, LazyLists[]]

Scanning subsets chunks using only documented functions

If you want to apply some function to all k-combinations, but taken in chunks, something like following function can be useful (version with some inspirations from belisarius's comment, a bit more robust than my previous version):

ClearAll[scanSubsetsChunks]
scanSubsetsChunks[f_, data_, nspec_:All, chunkLength_Integer?Positive] :=
    Module[
        {
            i = chunkLength + 1, 
            getChunk = 
                Quiet[
                    Subsets[data, nspec, {#, # + chunkLength - 1}],
                    Subsets::take
                ]&,
            chunk
        },
        chunk = Check[getChunk[1], Return[$Failed, Module]];
        While[chunk =!= {},
            f[chunk];
            chunk = getChunk[i];
            i += chunkLength;
        ]
    ]

scanSubsetsChunks[Print, Range[5], {3}, 3]
(* {{1,2,3},{1,2,4},{1,2,5}}
   {{1,3,4},{1,3,5},{1,4,5}}
   {{2,3,4},{2,3,5},{2,4,5}}
   {{3,4,5}} *)
jkuczm
  • 15,078
  • 2
  • 53
  • 84