A simple solution uses Partition and Count:
$data = {1,0,2,0,3,0,Null,0,2,4,5,6,7,0};
$runLength = 4;
$tolerance = 2;
Cases[Partition[$data, $runLength, 1], l_ /; Count[l, 0|Null] <= $tolerance]
{{1,0,2,0},{0,2,0,3},{2,0,3,0},{Null,0,2,4},{0,2,4,5},{2,4,5,6},{4,5,6,7},{5,6,7,0}}
Unfortunately, this solution does not scale well. It uses a lot of memory by creating and scanning a list containing every run. For large input lists, this is not feasible. Furthermore, it performs a number of operations proportional to the product of the input and run lengths. The following function performs better, using far less memory and with performance linearly proportional to the input size:
goodRunPositions[data_, runLength_, tolerance_] :=
data /. Null -> 0 //
Sign //
Abs //
1 - # & //
Accumulate //
Prepend[#, 0] &//
#[[runLength+1;;]] - #[[;;-runLength-1]] &//
Position[#, n_ /; n <= tolerance] &//
Flatten
goodRunPositions[$data, $runLength, $tolerance]
{1,2,3,7,8,9,10,11}
A helper function can be used to extract a run at any given position:
extractRun[data_, runLength_, position_] :=
data[[position ;; position+runLength-1]]
extractRun[$data, $runLength, 7]
{Null,0,2,4}
... or to randomly select a run ...
randomGoodRun[data_, runLength_, tolerance_] :=
extractRun[
$data
, $runLength
, RandomChoice @ goodRunPositions[data, runLength, tolerance]
]
randomGoodRun[$data, $runLength, $tolerance]
{2,4,5,6}
Performance is adequate for larger lists:
randomData[w_, n_] :=
RandomChoice[{w, w, 100, 100, 100, 100} -> {Null, 0, 1, 2, 3, 4}, n]
SeedRandom[923874]
$10K = randomData[20, 10000];
$1M = randomData[25, 1000000];
$10M = randomData[30, 10000000];
goodRunPositions[$10K, 250, 10] // Timing // Column
0.016 {7425,7426,7427,7428,7429,7430,7431,7432,7433,7434,7435,7436,7437,7438,7439,7440,7441,7442,7443,7446,7447,7448,7449,7450,7451,7452,7453,7454,7455,7456,7457,7458,7459,7460,7461,7462,7463,7464,7465,7466,7467,7468,7469,7470,7471,7472,7473,7474,7475,7476,7477,7478,7479,7480,7481,7482,7483,7484,7485,7486,7487,7488,7489,7490,7491,7492,7493,7494,7495,7496,7497,7498}
goodRunPositions[$1M, 250, 10] // Timing // Column
0.811 {5879,5880,5881,5882,5883,5884,5885,5886,5887,5888,5889,5890,5891,5892,5893,5894,5895,5896,5897,5898,5899,5900,5901,5902,5903,5904,5905,5906,5907,5908,5909,5910,5911,5912,896682,896683,896684}
goodRunPositions[$10M, 250, 10] // Timing // Column
8.221 {3399940,3399941,3399942,3399943,3399944,3399945,3399946,3399947,3399948,3399949,3399950,3399951,3399952,3399953,3399954,3399955,3399956,3399957,3399958,3399959,3399960,3399961,3399962,3399963,3399964,3399965,3399966,3399967,3399968,7027173}
How Does It Work?
Here is our toy data set:
$data
{1,0,2,0,3,0,Null,0,2,4,5,6,7,0}
The first step is to identify nulls and zeros. In order to allow us to use the fast built-in tensor operations, we change nulls to zeroes first:
$data /. Null -> 0 // Sign // Abs // 1 - # &
{0,1,0,1,0,1,1,1,0,0,0,0,0,1}
Every null or zero value has been assigned a score of one and other values are scored zero. We can now generate a running total of the number of zero values encountered:
% // Accumulate
{0,1,1,2,2,3,4,5,5,5,5,5,5,6}
To determine the number of zeroes in any given run, we can simply take the difference between the number of zeroes at the beginning and the end of the run. We'll need to prepend a zero to the list to handle the first run:
Prepend[%, 0] //
#[[$runLength+1;;]] - #[[;;-$runLength-1]] &
{2,2,2,3,3,3,2,1,0,0,1}
Now it is just a simple matter of finding the positions of the elements where the zero count remains below the threshold:
Position[%, n_ /; n <= $tolerance] // Flatten
{1,2,3,7,8,9,10,11}
We can use extractRun inspect the runs themselves:
extractRun[$data, $runLength, #] & /@ % // Column
{1,0,2,0}
{0,2,0,3}
{2,0,3,0}
{Null,0,2,4}
{0,2,4,5}
{2,4,5,6}
{4,5,6,7}
{5,6,7,0}
_Integer?Positivemight be necessary here; unless you want to consider the generalization where the third argument could be a negative integer... – J. M.'s missing motivation Feb 04 '12 at 13:00Length[{x}]+1instead ofLength[{x}+1]in the first version ofgetSamples? Also, thePartitionversion seems to be faster for larger lists than the first version. – Heike Feb 04 '12 at 13:06Partition, since it is more memory-efficient. You can also get positions, by modifying your code a bit. It won't find as many as thePartition-based one though (because the latter will also find the overlapping ones), whilePartition-based one won't find as many as the one based onReplaceList(which will find them all - something probably not very useful). – Leonid Shifrin Feb 04 '12 at 13:13