Check in a series if there exists adjacent values with less than a certain number of missing values

Question

I'm working with a large number of series of different experiments, where there exist a lot of experiments where there are no entries. I now want to draw a sample where there should exist at least 200 adjacent values of data with less than 100 entries of null (A 200 day period of complete data).

I started to do it with a loop, checking for 0-200, 1-200 for each dataset which takes ages... Should I use Row / Scan/ Reap/ Sow? I'd appreciate a help and guidance on possible functions I can use.

score 7 · Answer 1 · edited Apr 13 '17 at 12:55

7

Here is a rule-based version, which may be not the fastest but seems elegant (this is based on the same idea as the solution described in this answer):

Clear[getSamples];
getSamples[data_List, minlen_Integer?Positive, n : (_Integer | All) : All] :=
   ReplaceList[
     data,
     {x___, p : Repeated[Except[0 | Null], {minlen, Infinity}], ___} :> {{p}, Length[{x}+1]},
     If[n === All, Sequence @@ {}, n]
   ]

It will return some number (ot all) of samples together with their starting positions. Here is a test sample (much smaller than what you need, just for an illustration):

testSample = RandomChoice[{0, Null}~Join~Range[5], 30]

(*
   ==>  {0, 3, 1, 0, 5, 5, 1, 2, Null, 1, 2, 2, 1, 0, 4, 1, 4, 3, 0, 0, 4, 1, 4, 0, 5,
           Null, 1, 4, Null, 2}
*)

Here are examples. First, find all samples with length at least 4:

getSamples[testSample, 4]

(*
   ==> {{{5, 5, 1, 2}, 5}, {{1, 2, 2, 1}, 10}, {{4, 1, 4, 3}, 15}}
*)

And now just the first of them (faster):

getSamples[testSample, 4, 1]

(*
   ==>  {{{5, 5, 1, 2}, 5}}
*)

This assumes that you have your data in the format of testSample.

An alternative way would be to use Partition, but it may be memory - inefficient for very large data (you create a list occupying about 250 times the memory of the original one, for your case). It can also be slower, if the needed sample can be found very early on, although typically will be faster. Still, here is the code:

Clear[getSamplesP];
getSamplesP[data_List, minlen_Integer?Positive, n : (_Integer | All) : All] := 
   Select[
     Transpose[{#, Range[Length[#]]}] &@ Partition[data, minlen, 1], 
     ! MemberQ[First@#, 0 | Null] &, 
     If[n =!= All, n, Sequence @@ {}]
]

you can use this in the same way.

edited Apr 13 '17 at 12:55

Community

1

answered Feb 04 '12 at 12:44

Leonid Shifrin

114,335
15
329
420

oops, 14 min too late again... – acl Feb 04 '12 at 12:59
A _Integer?Positive might be necessary here; unless you want to consider the generalization where the third argument could be a negative integer... – J. M.'s missing motivation Feb 04 '12 at 13:00
@J.M. Thanks, edited (although strict type-checking was not my primary intention here). – Leonid Shifrin Feb 04 '12 at 13:03
Shouldn't that be Length[{x}]+1 instead of Length[{x}+1] in the first version of getSamples? Also, the Partition version seems to be faster for larger lists than the first version. – Heike Feb 04 '12 at 13:06
@Heike I corrected already, thanks. As for large lists, you may be right, I just did not like the idea of such huge memory-inefficiency (like 250 times the memory for the original list). I will rephrase. – Leonid Shifrin Feb 04 '12 at 13:06
@J.M. Why don't you put this as an answer? I like it more than Partition, since it is more memory-efficient. You can also get positions, by modifying your code a bit. It won't find as many as the Partition-based one though (because the latter will also find the overlapping ones), while Partition-based one won't find as many as the one based on ReplaceList (which will find them all - something probably not very useful). – Leonid Shifrin Feb 04 '12 at 13:13
@acl That was what I was thinking many times the last few days. Upon reflection, feels a bit sick though. – Leonid Shifrin Feb 04 '12 at 13:21
@Leonid Definitely not healthy – acl Feb 04 '12 at 15:29

J. M.'s missing motivation · Answer 2 · 2012-02-04T15:25:26.163

At Leonid's prodding, and some mulling over on how to modify my original idea, I think this does what you want:

getSamples[data_List, minlen_Integer?Positive, n:(_Integer|All):All] :=
Module[{spl = SplitBy[data, MatchQ[#, 0|0.|Null] &]},
       Take[Select[Transpose[{spl, (Accumulate[#] - # + 1) &[Length /@ spl]}], 
                   (Length[First[#]] >= minlen) &], n]]

The SplitBy[] is key here; data is split according to whether it contains 0, 0. or Null (that is, with the pattern 0|0.|Null), and the sublists of length minlen or longer are taken, I'll use Leonid's example here as well:

testSample = {0, 3, 1, 0, 5, 5, 1, 2, Null, 1, 2, 2, 1, 0, 4, 1, 4,
              3, 0, 0, 4, 1, 4, 0, 5, Null, 1, 4, Null, 2};

getSamples[testSample, 4]
{{{5, 5, 1, 2}, 5}, {{1, 2, 2, 1}, 10}, {{4, 1, 4, 3}, 15}}

getSamples[testSample, 4, -2]
{{{1, 2, 2, 1}, 10}, {{4, 1, 4, 3}, 15}}

Positive n does what you expect; negative values of n, as with Mathematica convention, take from the right.

If one wants to account for overlapping as well, some modification is necessary (borrowing slightly from Leonid):

getSamples[data_List, minlen_Integer?Positive, n:(_Integer|All):All] := 
 Module[{da = SplitBy[data, MatchQ[#, 0|0.|Null] &]},
  da = Function[l, 
     Transpose[
      Through[{Identity, Last[l] + Range[Length[#]] - 1 &}[
        Partition[First[l], minlen, 1]]]]] /@ 
    Select[Transpose[{da, (Accumulate[#] - # + 1) &[
        Length /@ da]}], (Length[First[#]] >= minlen) &];
  Take[Flatten[da, 1], n]]

As an example:

getSamples[testSample, 3]
{{{5, 5, 1}, 5}, {{5, 1, 2}, 6}, {{1, 2, 2}, 10}, {{2, 2, 1}, 11},
 {{4, 1, 4}, 15}, {{1, 4, 3}, 16}, {{4, 1, 4}, 21}}

thank you both a lot for the answer, as I'm not a very advanced user yet, I will take my time and start to understand the code fully, I'll give feedback when i sucessfully (or not) implemented it ;) — PeriodicProgrammer, Feb 04 '12 at 14:20
unfortunately its true that this only gives me list with no zeros. But i have a limit of 10 zeroes because 1 missing element would make a lot of experiments useless, even though they are of good quality, see Heike's ansatz.. — PeriodicProgrammer, Feb 04 '12 at 16:21
@Per: Could you please come up with a minimal example and show what output you're expecting from the function(s) you want to see from us? — J. M.'s missing motivation, Feb 04 '12 at 16:24
@J.M. getSamples[data, fixlen, maxzero, n]; getsamples[testsample, 4,1]: Leads to: {3, 1, 0, 5} {1, 0, 5, 5} {0, 5, 5, 1} {5, 5, 1, 2} … {4, 1, 4, 3} {1, 4, 3, 0} {0, 4, 1, 4} This is needed because: I need to (mostly) unbiased draw a random set of 1 year data (thus I will afterwards randomly select 1 valid list, if the experiment contains one). Dublicates needed as every resulting list is subject to another time frame. — PeriodicProgrammer, Feb 04 '12 at 16:56

Heike · Accepted Answer · 2012-02-04T22:03:05.223

Suppose $p_k$ with $1\leq k\leq N_0$ is the position of the $k$-th zero in the data list $\{a_1,\ldots,a_N\}$, then each sublist which contains $\leq M$ zeros will be a sublist of one of the sublists $$ \{ a_{p_k+1},\ldots, a_{p_{k+M+2}-1}\}, \qquad 1\leq k \leq N_{0}-M-2 $$ or $\{ 1,\ldots, a_{p_{M+1}-1}\}$ or $\{a_{p_{N_0-M-1}+1},\ldots, a_N\}$, so in Mathematica speak it is sufficient to construct the lists

Take[lst {zlst[[k]]+1 ;; zlst[[k+maxZeros+2]]-1}]

for which zlst[[k+maxZeros+2]]-zlst[[k]]-1 >= minLen where zlst is the list of positions of the zeros in lst padded with 0 on the left and Length[lst]+1 on the right. This could be implemented as

findLsts2[lst_, minLen_Integer?Positive, maxZeros_, 
  n : (_Integer | All) : All] :=
 Module[{zeroLst, rangeLst},
  zeroLst = Flatten[Position[lst, (0 | 0. | Null)]];
  If[Length[zeroLst] <= maxZeros,
   lst,
   rangeLst = Transpose[{Prepend[Drop[zeroLst + 1, -maxZeros], 1],
      Append[Drop[zeroLst - 1, maxZeros], Length[lst]]}];
   {Take[lst, #], #} & /@ 
    Select[rangeLst, (#[[2]] - #[[1]] + 1) >= minLen &, 
     n /. All -> Sequence[]]
   ]
  ]

This would return the sublists of maximum length which contain at most maxZeros zeros together with the indices of these sublists in the data list lst. Example

findLsts2[{1, 0, 2, 0, 3, 0, 0, 0, 2, 4, 5, 6, 7, 0}, 4, 2] // MatrixForm

Mathematica graphics

Edit

This updated version of findLsts2 randomly selects n sublists from all possible sublists of length len with at most maxZeros zeros from lst.

findLsts3[lst_, len_Integer?Positive, maxZeros_, 
  n : (_Integer | All) : All] :=
 Module[{zeroLst, rangeLst, startLst},
  zeroLst = Flatten[Position[lst, (0 | 0. | Null)]];
  If[Length[zeroLst] <= maxZeros,
   startLst = Range[Length[lst] - len + 1],
   rangeLst = Transpose[{Prepend[Drop[zeroLst + 1, -maxZeros], 1],
      Append[Drop[zeroLst - len, maxZeros], Length[lst] - len + 1]}];
   startLst = Flatten[Range @@@ (List @@ (Interval @@ 
          Select[rangeLst, (#[[2]] - #[[1]]) >= 0 &])), 1]];
  {lst[[# ;; # + len - 1]], #} & /@ Sort@RandomSample[startLst, n]
  ]

For the previous example we get

findLsts2[{1, 0, 2, 0, 3, 0, 0, 0, 2, 4, 5, 6, 7, 0}, 4, 2] // MatrixForm

Mathematica graphics

Edit 2

The $k$-th pair in rangelst in findLsts3 corresponds to the values $\{j_1, j_2\}$ such that $$p_{k-1}+1 = j_1 \qquad\text{and}\qquad j_2+l-1 = p_{k+M+1}-1$$ where $l$ is the length of the sublist you want to extract and $p_k$ is as above. Provided that $j_{1}\leq j_{2}$ we then have that for all $j_{1}\leq j\leq j_{2}$ the sublist $\{a_j,\ldots,a_{j+l-1}\}$ is a subset of $\{ a_{p_{k-1}+1},\ldots, a_{p_{k+M+1}-1}\}$. What the Select statement in startLst does is to filter out the pairs in rangelst for which $j_1\leq j_2$ doesn't hold. startLst is then the Union of the ranges Range[j1, j2] for all remaining pairs.

thank you a lot for your answer. It's true I cannot use the previous answers as they only generate lists containing no zeros values. But I still have a Issue here: I'm working with time series: Thus I need to have the list {0,2,0,3} {2,5} too, as it correspond to different time data points. I will try to look trough the code, any Idea how to adjust? The Idea Behind: I will only need to Randomly pick 1 list of at least 250, but with a random date. Thus to not bias the statistics, I will need to have overlapping lists. — PeriodicProgrammer, Feb 04 '12 at 16:18
@PeriodicProgrammer Do you want to choose the lengths of the samples randomly as well, or are they fixed somehow (for example equal to the maximum length allowed)? — Heike, Feb 04 '12 at 16:24
@ heike: the length is fixed to 251 data point, this corresponds to 1 event day + 250 workdays, thus 1 year.
An adaption of your solution would be to generate sublist of every resultlist > 250 days, but because I'm working with a lot of data this might not be the fastest version. — PeriodicProgrammer, Feb 04 '12 at 16:27
The Result looks correct, the random selection considers all possible list? I will find it out but I have to retrace the code (which will take a moment :P). Thank you a lot for your support!! — PeriodicProgrammer, Feb 04 '12 at 18:21
@PeriodicProgrammer Yes it does. the code works by generating a list of all possible first indices of the sublists (which is startLst in the code) and taking a random sample of those. — Heike, Feb 04 '12 at 18:25
Thank you a lot. Could you give me a brief explanation of rangelst{{1,2},{3,3},{5,4},{7,10},{8,11}} and the (#[[2]] - #[[1]]) >= 0 &]? I'm stuck here at the momens :s.. thanks a lot! — PeriodicProgrammer, Feb 04 '12 at 21:04
additionally, it does not cover cases of a lot of zeros. so far: {0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} returns too much lists with zero. probably one has to just remove all lists where Count[zeros]=length resultlist. Edit: It Just occurs for len=maxZeros (and len<maxZero which goes against logic anyway). — PeriodicProgrammer, Feb 04 '12 at 21:17
@PeriodicProgrammer I've added an explanation of rangelst to my solution. What values did you take for len and maxZeros in the case for {0,0,1,0,....0}? If I try for example findLsts3[{0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, 4, 2] I get an empty list which is what you would expect. — Heike, Feb 04 '12 at 22:05
thank you for your explanation: I used the following values: testsample = {0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.}; findLsts3[testsample, 4, 4] — PeriodicProgrammer, Feb 05 '12 at 00:09

score 2 · Answer 4 · answered Feb 04 '12 at 19:35

A simple solution uses Partition and Count:

$data = {1,0,2,0,3,0,Null,0,2,4,5,6,7,0};
$runLength = 4;
$tolerance = 2;

Cases[Partition[$data, $runLength, 1], l_ /; Count[l, 0|Null] <= $tolerance]

{{1,0,2,0},{0,2,0,3},{2,0,3,0},{Null,0,2,4},{0,2,4,5},{2,4,5,6},{4,5,6,7},{5,6,7,0}}

Unfortunately, this solution does not scale well. It uses a lot of memory by creating and scanning a list containing every run. For large input lists, this is not feasible. Furthermore, it performs a number of operations proportional to the product of the input and run lengths. The following function performs better, using far less memory and with performance linearly proportional to the input size:

goodRunPositions[data_, runLength_, tolerance_] :=
  data /. Null -> 0 //
  Sign //
  Abs //
  1 - # & //
  Accumulate //
  Prepend[#, 0] &//
  #[[runLength+1;;]] - #[[;;-runLength-1]] &//
  Position[#, n_ /; n <= tolerance] &//
  Flatten

goodRunPositions[$data, $runLength, $tolerance]

{1,2,3,7,8,9,10,11}

A helper function can be used to extract a run at any given position:

extractRun[data_, runLength_, position_] :=
  data[[position ;; position+runLength-1]]

extractRun[$data, $runLength, 7]

{Null,0,2,4}

... or to randomly select a run ...

randomGoodRun[data_, runLength_, tolerance_] :=
  extractRun[
    $data
  , $runLength
  , RandomChoice @ goodRunPositions[data, runLength, tolerance]
  ]

randomGoodRun[$data, $runLength, $tolerance]

{2,4,5,6}

Performance is adequate for larger lists:

randomData[w_, n_] :=
  RandomChoice[{w, w, 100, 100, 100, 100} -> {Null, 0, 1, 2, 3, 4}, n]

SeedRandom[923874]
$10K = randomData[20, 10000];
$1M = randomData[25, 1000000];
$10M = randomData[30, 10000000];

goodRunPositions[$10K, 250, 10] // Timing // Column

0.016 {7425,7426,7427,7428,7429,7430,7431,7432,7433,7434,7435,7436,7437,7438,7439,7440,7441,7442,7443,7446,7447,7448,7449,7450,7451,7452,7453,7454,7455,7456,7457,7458,7459,7460,7461,7462,7463,7464,7465,7466,7467,7468,7469,7470,7471,7472,7473,7474,7475,7476,7477,7478,7479,7480,7481,7482,7483,7484,7485,7486,7487,7488,7489,7490,7491,7492,7493,7494,7495,7496,7497,7498}

goodRunPositions[$1M, 250, 10] // Timing // Column

0.811 {5879,5880,5881,5882,5883,5884,5885,5886,5887,5888,5889,5890,5891,5892,5893,5894,5895,5896,5897,5898,5899,5900,5901,5902,5903,5904,5905,5906,5907,5908,5909,5910,5911,5912,896682,896683,896684}

goodRunPositions[$10M, 250, 10] // Timing // Column

8.221 {3399940,3399941,3399942,3399943,3399944,3399945,3399946,3399947,3399948,3399949,3399950,3399951,3399952,3399953,3399954,3399955,3399956,3399957,3399958,3399959,3399960,3399961,3399962,3399963,3399964,3399965,3399966,3399967,3399968,7027173}

How Does It Work?

Here is our toy data set:

$data

{1,0,2,0,3,0,Null,0,2,4,5,6,7,0}

The first step is to identify nulls and zeros. In order to allow us to use the fast built-in tensor operations, we change nulls to zeroes first:

$data /. Null -> 0 // Sign // Abs // 1 - # &

{0,1,0,1,0,1,1,1,0,0,0,0,0,1}

Every null or zero value has been assigned a score of one and other values are scored zero. We can now generate a running total of the number of zero values encountered:

% // Accumulate

{0,1,1,2,2,3,4,5,5,5,5,5,5,6}

To determine the number of zeroes in any given run, we can simply take the difference between the number of zeroes at the beginning and the end of the run. We'll need to prepend a zero to the list to handle the first run:

Prepend[%, 0] //
#[[$runLength+1;;]] - #[[;;-$runLength-1]] &

{2,2,2,3,3,3,2,1,0,0,1}

Now it is just a simple matter of finding the positions of the elements where the zero count remains below the threshold:

Position[%, n_ /; n <= $tolerance] // Flatten

{1,2,3,7,8,9,10,11}

We can use extractRun inspect the runs themselves:

extractRun[$data, $runLength, #] & /@ % // Column

{1,0,2,0}
{0,2,0,3}
{2,0,3,0}
{Null,0,2,4}
{0,2,4,5}
{2,4,5,6}
{4,5,6,7}
{5,6,7,0}

Sorry for the cursing at the end of the response #]&/@%// :) — WReach, Feb 04 '12 at 19:50

Check in a series if there exists adjacent values with less than a certain number of missing values

4 Answers4