12

I have a list of around 300 elements. I want to sample subsets of length 25 such that my samples are all distinct. My first inclination was to use something like RandomSample[Subsets[list, {25}], 1000], but the problem is the number of subsets of length 25 out of a 300 element set is way to big for the computer to deal with. Anyone have a nice way to do this?

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
Wintermute
  • 477
  • 3
  • 10

6 Answers6

14

This question may be a duplicate but for the time being:

list = Range[300];

The number of subsets length 25:

n = Binomial[300, 25]
1953265141442868389822364184842211512

Five samples:

samp = RandomInteger[{1, n}, 5]
{1097179597483122074395819626389736050,
 1278400886908268917844987164926797363,
 1855898035549513136165016617586671669,
 1005956584417012779260052361741534263,
 1845054078551378518016127833496347335}

Your subsets:

 Subsets[list, {25}, {#}][[1]] & /@ samp
{{10,15,57,64,65,73,82,115,120,130,133,160,161,164,178,192,196,218,223,235,238,240,267,271,290},
{12,54,58,81,90,91,115,130,146,181,189,204,205,218,222,230,233,234,235,254,256,268,281,283,284},
{33,42,45,65,78,81,85,118,151,167,172,174,202,203,207,208,211,212,223,239,246,251,254,262,267},
{9,12,35,69,72,77,79,109,113,116,141,144,158,163,195,202,221,228,230,231,254,259,267,280,292},
{32,39,49,53,62,102,104,132,135,159,164,167,169,172,191,211,244,245,253,263,265,271,282,283,286}}

Be aware that RandomInteger could produce duplicate samples however for the example given it is extremely unlikely. You can produce more an use DeleteDuplicates and Take as needed.


I think kguler's answer is the better method, and I wish I had had the insight to realize it myself, however there is still some value in the method above. Referring to subsets by a single number can make them easier to handle.

  • They take less space.
  • Comparison (e.g. for removing duplicates) requires a single numeric comparison rather than a list comparison.
  • A given subset is independent of the input list; only length of input and subset matter.

One can "unrank" them at any time using the third parameter of Subsets as shown above.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
14

I think RandomSample already does exactly what you need:

RandomSample: enter image description here

RandomSample[Range[300], 25]
(* {292, 257, 36, 83, 259, 245, 280, 270, 24, 236, 186, 100, 300, 240, 
    176, 295, 42, 105, 97, 106, 60, 114, 63, 25, 253} *)

Table[RandomSample[Range[300], 25], {5}] 
{{221, 54, 124, 64, 168, 91, 149, 25, 142, 87, 184, 288, 93, 105, 95, 195, 264, 180, 300, 241, 185, 126, 237, 52, 160}, 
 {16, 59, 145, 181,  109, 163, 99, 81, 24, 257, 2, 293, 298, 78, 207, 162, 19, 277, 133, 28, 23, 254, 237, 93, 242}, 
 {210, 62, 76, 122, 196, 90, 84, 117, 256, 216, 95, 197, 107, 260, 78, 241, 64, 173, 169, 224, 160, 265, 163, 39, 261},
 {48, 146, 103, 10, 259, 142, 197, 77, 39, 260, 75, 128, 137, 181, 99, 256, 82, 127, 294, 117, 250, 76, 276, 196, 11}, 
 {193, 275, 48, 299, 42, 159, 75, 251, 241, 179, 165, 169, 87, 224, 288, 133, 94, 72, 274, 109, 296, 264, 82, 51, 124}}

Dealing with possible duplicates:

rsF = Module[{t = Table[RandomSample[#, #2], {#3}]},
             While[Length[DeleteDuplicates@t] < #3, 
                   t = Join[DeleteDuplicates@t, {RandomSample[#, #2]}]]; t] &;

rsF[Range[300], 10, 5] // Grid

enter image description here

kglr
  • 394,356
  • 18
  • 477
  • 896
  • Why did you delete this? – Mr.Wizard Feb 19 '15 at 20:21
  • @Mr.Wizard, wasn't sure if i got the OP's requirements right. – kglr Feb 19 '15 at 20:36
  • To match the output of Subsets, if desired, the elements should be given in the order of the input list. And you must make sure there are no duplicate samples, which here is slightly more complex than with my method. However otherwise I believe this is correct, and very clean. – Mr.Wizard Feb 19 '15 at 20:39
  • Thank you @Mr.Wizard. It does get more complicated to account for duplicates. – kglr Feb 19 '15 at 20:44
5

Combinatorica has a function for doing exactly this:

<< "Combinatorica`"
RandomKSubset[Range@300, 25] & /@ Range@5 // Grid

Mathematica graphics

You may get dups, though. Use DeleteDuplicates[] if you consider it necessary.

Dr. belisarius
  • 115,881
  • 13
  • 203
  • 453
3

Another approach: Generate a bunch of random samples, cull any duplicates, then take as many as you need.

 (DeleteDuplicates@Table[ Sort@RandomSample[list, 25] , {2000}] )[[;; 1000]]

a variant..

 RandomSample[ Union@Table[ Sort@RandomSample[list, 25] , {2000}] , 1000 ]

basically the same but there may be some performance difference.

george2079
  • 38,913
  • 1
  • 43
  • 110
1

Here's one way:

randomSubset[list_] := 
 DeleteCases[Map[If[RandomReal[] <= 1/2, #, Null] &, list], Null]

The random subset has the property that each member of list has equal probability of appearing or not appearing in the returned subset.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
  • The question asked for a subset of a specific length but I appreciate the creativity. You can avoid the DeleteCases filtering by using what I call the "vanishing function" like this: randomSubset[list_] := Map[If[RandomReal[] <= 1/2, #, ## &[]] &, list]. In recent versions there is also Nothing that works only with lists, but I prefer the generality of ## &[]. – Mr.Wizard Apr 09 '20 at 09:28
  • Thanks, I can use this info! – Kerry M. Soileau Apr 10 '20 at 11:10
0

I interpret the original question to mean that no element of any subset can appear in any other subset (much like RandomKSubsets). Here's an inelegant but workable approach. Basically, select numElements from myList and remove them from myList, and repeat numSets times. That way, you will never get any duplicates, i.e., elements that appear in two or more subsets.

myRandomKSubsets[myList_, numElements_, numSets_] := 
 Module[{temp, mynewList},
  results = {};
  mynewList = myList;
  Do[
   results = 
    Union[results, temp = {RandomSample[mynewList, numElements]}];
   mynewList = Complement[Flatten@mynewList, Flatten@temp],
   {numSets}];
  results
  ];

myRandomKSubsets[Range[300], 10, 5] // MatrixForm

$ \left( \begin{array}{cccccccccc} 17 & 99 & 298 & 111 & 40 & 227 & 151 & 68 & 29 & 169 \\ 72 & 241 & 247 & 152 & 58 & 16 & 79 & 92 & 243 & 206 \\ 114 & 129 & 95 & 54 & 264 & 11 & 119 & 174 & 172 & 167 \\ 128 & 237 & 299 & 38 & 210 & 252 & 81 & 135 & 245 & 90 \\ 300 & 213 & 157 & 276 & 80 & 185 & 112 & 221 & 156 & 204 \\ \end{array} \right) $

One benefit of this approach is that you never need to create "too many" subsets (and thus use too much memory) then cull them afterwards to avoid duplicates.

Someone with more time can streamline this code, likely using NestList.

David G. Stork
  • 41,180
  • 3
  • 34
  • 96
  • 1
    With this interpretation why not RandomSample[Range@300, 10*5] ~Partition~ 10 -- a derivative of bobthechemist's deleted answer? – Mr.Wizard Feb 19 '15 at 23:05
  • @Mr.Wizard Sure. That works. – David G. Stork Feb 19 '15 at 23:07
  • Subsets can contain similar elements, I just don't want the same subset more than once. Functions like RandomChoice[] cause this problem. – Wintermute Feb 20 '15 at 00:21
  • Surely you want more than that. Otherwise you could choose the first $n-1$ elements as a "base" and then create $k$ "distinct" subsets by individually appending each of the next $k$ elements in your initial list to this base. – David G. Stork Feb 20 '15 at 01:19