14

I am trying to tally and sort the number of consecutive occurrences in a list above a certain value.

For instance:

data={55.5, 65.3, 100.3, 100.1, 100.5, 98.7, 75.2, 101.3, 102.1}

The totals for this data set would be {3,2} if the tally criteria were all values greater than 100.

As I have worked on this problem, I found the following code:

a = RandomInteger[1, {100}];
Cases[Split[a], {1, ___}] // Tally // Sort

Which is very close to what I am trying to do. In this case, the code provides a sorted list of consecutive integers equal to one.

However, I can't seem to modify the decision criteria to only tally values above a certain criteria.

How might I go about this?

dixontw
  • 441
  • 2
  • 4

7 Answers7

8

Edit: there is an ambiguity in the question (and my answer) as to whether or not you want to include single-element "runs" in your tally.


If you merely want to count the elements in each run and not also return the elements I propose:

Cases[Split @ UnitStep[100 - data], x : {0, __} :> Length@x]
{3, 2}

If you (also) want to return the elements you could use this:

Module[{us = UnitStep[100 - data], tally},
 tally = Cases[Split[us], x : {0, ___} :> Length@x];
 dynamicPartition[Pick[data, us, 0], tally]
]
{{100.3, 100.1, 100.5}, {101.3, 102.1}}

You'll need my dynamicPartition function described here.


Timings

Comparing my proposal to rm-rf's:

data = RandomReal[200, 750000];

Cases[Split@UnitStep[100 - data], x : {0, __} :> Length@x]        // Length // Timing

Length /@ Split[data, # > 100 && #2 > 100 &] /. 1 -> Sequence[]   // Length // Timing

Select[Total[Split@Clip[data, {100, 100}, {0, 1}], {2}], # > 1 &] // Length // Timing

{0.109, 93479}

{0.671, 93479}

{0.655, 93479}

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
7

One direct way is to use the second argument of Split and find sublists of length > 0:

Length /@ Split[data, # > 100 && #2 > 100 &] /. 1 -> Sequence[]

or by clipping the values before splitting:

Select[Total[Split@Clip[data, {100, 100}, {0, 1}], {2}], # > 1 &]
rm -rf
  • 88,781
  • 21
  • 293
  • 472
6

Not sure about the efficiency but how about defining a test function:

testFunction = (# > 100 &)

and then

Length /@ Select[Split[testFunction /@ data], First]

or if you don't want singletons

Length /@ Cases[Split[testFunction /@ data], {True, __}]
MikeLimaOscar
  • 3,456
  • 17
  • 29
  • This does not give output that matches mine or rm-rf's; please check it. – Mr.Wizard Aug 13 '13 at 14:30
  • Okay, I'm not sure about the meaning of "consecutive occurrences" -- yours takes one interpretation, mine takes another. Clarification is needed. – Mr.Wizard Aug 13 '13 at 14:36
  • No, you're correct, mine reports "runs" of length one which probably aren't wanted. – MikeLimaOscar Aug 13 '13 at 14:39
  • I'm not so sure; the code in the question includes ___ (BlankNullSequence) which would match your interpretation. It's not clear IMHO. By the way, Thread should be a least slightly faster: Length /@ Select[Split[Thread[data > 100]], First] – Mr.Wizard Aug 13 '13 at 14:41
  • 1
    I've added an alternative that excludes singletons, using Cases, but still no Thread -- using the test function means it could be written as a generic RunLengths[data,test] function. – MikeLimaOscar Aug 13 '13 at 14:51
6

This answer is general and fast, allowing the use of any predicate (not just ones based on a numerical threshold) and using the fact that you'll have alternating runs of elements that satisfy the predicate and elements that don't satisfy the predicate:

runLengths[list_, pred_] :=
 With[{bits = Boole[pred@#] & /@ list},
  Length /@ Part[
    Split[bits],
    If[bits[[1]] == 0,
     Span[2, All, 2],
      Span[1, All, 2]]]]

EDIT to add: Some timings:

bigData = RandomReal[{0, 200}, {10000}];

Do[runLengths[bigData, # >= 100 &], {100}] // Timing

{0.218401, Null}

For comparison, here's Mr. Wizard's approach on my machine:

Do[
  Cases[Split@UnitStep[100 - bigData], x : {0, __} :> Length@x],
  {100}] // Timing

{0.249602, Null}

So it's a little faster, and a little more flexible.

EDIT again to add:

I'm assuming we want to keep runs of length 1; those can always be deleted later in any event.

EDIT the third: just getting rid of 1 at the end adds very little overhead:

Do[DeleteCases[
   runLengths[bigData, # >= 100 &], 1], 
   {100}] // Timing

{0.234002, Null}

Pillsy
  • 18,498
  • 2
  • 46
  • 92
  • Strange, using your own code my method is much faster. I am getting {0.78, Null} and {0.156, Null} for the two outputs. Version differences I guess? Also, your method includes "singletons" which may or may not be desired; can your solution be adapted to eliminate these if required? – Mr.Wizard Aug 13 '13 at 14:59
  • Could be version or even architecture variations. FWIW, I'm using Mathematica 9.0.1. – Pillsy Aug 13 '13 at 15:04
  • I have no reason to doubt your timings and you addressed the one concern. +1 – Mr.Wizard Aug 13 '13 at 15:08
5

As Mr. Wizard pointed out in the comments, my original code tests for elements greater or equal 100, a corrected version is in the edit below.

I interpreted the question to include singletons and came up with the following, which I think performs decently

Length/@DeleteCases[Split[UnitStep[data-100]],{0..}]

Edit

The following tests for elements greater than 100

Length/@DeleteCases[Split[UnitStep[100-data]],{1..}]

Thanks to Mr. Wizard to pointing out mistakes and performance issues in this post!

sebhofer
  • 2,741
  • 18
  • 25
  • I've got more critique, but please know I wouldn't bother except that I hope it helps. You don't need to use 1 - Unitstep[. . . as you can simply delete the 1's instead. Also, your second method is inefficient as you pattern-match twice; it would be better to either use Cases as I did, or Length /@ DeleteCases . . . as you did in your first method. If I were you I'd delete the second form of each one. +1 for the corrected versions. – Mr.Wizard Aug 14 '13 at 09:32
  • @Mr.Wizard True again. Do you just want to take over this answer? :) Editing... – sebhofer Aug 14 '13 at 09:49
  • No, it looks perfect now. :^) – Mr.Wizard Aug 14 '13 at 10:01
3

Using SequenceSplit which came with V 11.3

Length /@ SequenceSplit[data, {x_ /; x < 100}]

{3, 2}

{Length /@ #, #} & [SequenceSplit[data, {x_ /; x < 100}]]

{{3, 2}, {{100.3, 100.1, 100.5}, {101.3, 102.1}}}

eldo
  • 67,911
  • 5
  • 60
  • 168
1

Another way is to use ConsecutiveQ to split by positions as follows:

Length /@ SequenceCases[Position[data, x_ /; x > 100], {__}?ConsecutiveQ]

({3, 2})

{Length@#, Extract[data, #]} & /@ SequenceCases[Position[data, x_ /; x > 100], {__}?ConsecutiveQ]

({{3, {100.3, 100.1, 100.5}}, {2, {101.3, 102.1}}})

E. Chan-López
  • 23,117
  • 3
  • 21
  • 44