17

Often when I'm working with numerical data I want to select those points within a given range.

To do this I generally do something like:

Select[samp, min < # < max &]

or if I'm feeling lazy:

Select[samp, Between[ {min, max} ]]

But today I was thinking about how inefficient this is and decided to try to do better. My first attempt was the following:

selectBetween[
   data : {_?NumericQ, _, __},
   {min_, max_}
   ] :=
  Pick[data,
   UnitStep[data - max] + UnitStep[min - data],
    0
   ];
selectBetweenBy[test_, {min_, max_}][data_] :=
 selectBetweenBy[data, test, {min, max}]

Which we can test for correctness:

samp = RandomReal[1, 100000];
selectBetween[samp, {.001, .01}] == Select[samp, .001 < # < .01 &]

True

And then for performance:

selectBetween[samp, {.001, .01}] // RepeatedTiming // First

0.000544

Select[samp, .001 < # < .01 &] // RepeatedTiming // First

0.076

Select[samp, Between[ {.001, .01}]] // RepeatedTiming // First

0.27

Already we're doing pretty well.

And if we want to extend to other data we can give it an operator and By form:

selectBetween[{min_, max_}][data_] :=
  selectBetween[data, {min, max}];
selectBetweenBy[
  data_,
  test_,
  {min_, max_}
  ] :=
 With[{
   testData =
    Replace[test, 
     {
      n_Integer :> data[[All , n]],
      _ :> Map[test, data]
      }
     ]},
  Pick[data,
   UnitStep[testData - max] + UnitStep[min - testData],
    0
   ]
  ]

But can we do better? I'm sure what I did there can be tweaked an optimized and given how often I have to do this and the scale of the data I am often working on it would be nice to have it more efficient.


As kglr notes this question: Finding all elements within a certain range in a sorted list has some answers that are very, very closely related (as they operate on arbitrary unsorted data). Many of them are probably more efficient than what I've posted.

J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
b3m2a1
  • 46,870
  • 3
  • 92
  • 239
  • Have you tried to search around? Will try tomorrow but it looks like a problem that should have already been posted. – Kuba Jan 25 '18 at 23:24
  • @Kuba I did look around, but not exhaustively. I'm guessing this is a duplicate, but couldn't find the exact phrasing I'd need to find it. – b3m2a1 Jan 25 '18 at 23:26
  • Are you interested in selecting repeatedly from the same sample? Or will the sample change every time? If selecting from the same sample, then there are optimizations which might improve timing by an order of magnitude. – Carl Woll Jan 25 '18 at 23:40
  • 3
  • @CarlWoll I was imagining an arbitrary sample. So not necessarily selecting from the same sample. On the other hand I would be interested in hearing about any optimizations for selecting against the same sample. – b3m2a1 Jan 25 '18 at 23:43
  • @kglr Good link. But as it stands my data is not necessarily sorted (and I don't necessarily want to sort it for efficiency / reuse reasons). – b3m2a1 Jan 25 '18 at 23:44
  • 1
    @b3m2a1 Just because the question has the word "sorted" in it doesn't mean that all answers require sorted lists. It's a very good link and very related based on the answers that it has. – C. E. Jan 26 '18 at 07:09
  • @C.E. I actually looked through and saw that many of the solutions there handle this case nicely. As is, I'm not sure if it's worth closing this one as a dupe (because Leonid's first binary-search solution there obviously requires sorted data), but certainly many of the answers there will do for this. – b3m2a1 Jan 26 '18 at 07:12

2 Answers2

16

Varying samples

(Updated to include internal function)

For varying samples, I think the best non-compiled method is to use Pick as you did. You can speed up the selector list slightly by using Subtract, and you can reduce the number of arithmetical operations by 1 as follows:

sb2[samp_, {min_, max_}] := Pick[
    samp,
    UnitStep[Subtract[samp, min] Subtract[samp, max]],
    0
]

In comparison to your answer, sb2 has 2 subtractions, 1 multiplication and 1 UnitStep call, while selectBetween has 2 subtractions, 1 addition and 2 UnitStep calls.

Comparison:

samp = RandomReal[10, 10^6];

r1 = selectBetween[samp, {.001, .01}]; //RepeatedTiming
r2 = sb2[samp, {.001, .01}]; //RepeatedTiming

r1===r2

{0.00690, Null}

{0.0059, Null}

True

Slightly faster.

Varying sample addendum 1

Here is a different selector that is slightly faster:

sb3[samp_, {min_, max_}] := Pick[
    samp,
    Unitize @ Clip[samp, {min, max}, {0,0}],
    1
]

Comparison:

samp = RandomReal[10, 10^6];

r1 = selectBetween[samp, {.001, .01}]; //RepeatedTiming
r2 = sb3[samp, {.001, .01}]; //RepeatedTiming

r1===r2

{0.0069, Null}

{0.0055, Null}

True

Note that the selector would need to be modified if the interval could contain 0.

Varying sample addendum 2

If you don't mind using an undocumented internal function, you could try:

samp = RandomReal[10, 10^6];

r1 = selectBetween[samp, {.001, .01}]; //RepeatedTiming
r2 = Random`Utilities`SelectWithinRange[samp, {.001, .01}]; //RepeatedTiming

r1 === r2

{0.00688, Null}

{0.0019, Null}

True

Fixed sample

If the sample stays the same, and you are selecting different ranges of data, then the following will be much faster:

nf = Nearest[samp]; //AbsoluteTiming

r3 = nf[(.01+.001)/2, {All, (.01-.001)/2}];//RepeatedTiming

Sort @ r1 === Sort @ r3

{0.153673, Null}

{0.000071, Null}

True

Creating the NearestFunction is slow, but only needs to be done once, and then using the NearestFunction will be extremely fast.

Carl Woll
  • 130,679
  • 6
  • 243
  • 355
  • How much of the operations of the NearestFunction are you at liberty to share? I'm just interested in how it can be so efficient on arbitrary data like that. – b3m2a1 Jan 26 '18 at 02:33
  • 2
    @b3m2a1 I don't know much about it, other than that it uses a k-d tree under the hood. I think of it as a tool that can do vectorized "binary" searches. – Carl Woll Jan 26 '18 at 03:09
  • It seems that Pick[..., ..., True] is a little faster than Pick[..., ..., 1]. – Αλέξανδρος Ζεγγ Jan 26 '18 at 05:36
  • The generation of NearestFunction is relatively time consuming, however. – Αλέξανδρος Ζεγγ Jan 26 '18 at 05:43
  • 1
    @AlexanderZeng if it's building out a k-d tree that's to be expected. That would also nicely explain how it can be so fast. I wonder what its time-complexity looks like. – b3m2a1 Jan 26 '18 at 07:15
  • 1
    @CarlWoll I am also very curious: Is it possible to access and manipulate the internals of a NearestFunction? The operations that I have in mind are primarily appending and deleting points in the point cloud. In particular, appending should be very cheap (to some extend; I know that the stored tree may become unbalanced) and might come handy when dealing with growing data sets... – Henrik Schumacher Jan 26 '18 at 07:39
  • 2
    @HenrikSchumacher See this Q&A – Sascha Jan 26 '18 at 10:28
  • @Sascha Thank you for the link! It does not precisely answer my question, but it seems to supply an acceptable workaround. – Henrik Schumacher Jan 26 '18 at 11:01
6

BoolEval was made exactly for these kinds of problems. Suppose you want to select elements from array which are between lo and hi. Then use:

Needs["BoolEval`"]
BoolPick[array, lo < array < hi]

Simple, concise, and as fast as it gets for an unsorted list (with documented functions). Using <= instead of < is also possible, and it is handled correctly.

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263