2

Building up on the solution proposed here : Simplifying nested If statements

You can find here the data set : allGazes.dat

allGazesX = 
 Uncompress@
   Import[FileNameJoin[{NotebookDirectory[], "allGazes.dat.gz"}], 
    "String"];

I need to filter large data set and believe I lack an efficient method to do so. The purpose here is to filter given the EuclideanDistance[] between gazes. Below is what I am using currently :

This is what i am using currently :

GZ[delta_] := ParallelTable[
              Table[
                    Reap[z = allGazesX[[subNO, dispNo, 1, ;; 2]];Sow[z];
                    Scan[
                         If[
                            EuclideanDistance[#, z] > delta,
                            z = #;Sow[z]] &,
                            allGazesX[[subNO, dispNo, All, ;; 2]]]][[2, 1]],

              {dispNo, Range[Length[allGazesX[[subNO]]]]}],
              {subNO, Range[5]}];
500
  • 5,569
  • 6
  • 27
  • 49
  • Might I ask why you aren't using ParallelTable for both dispNo and subNO? – rcollyer Mar 02 '12 at 15:13
  • Not having actual data, I'm not exactly sure what this is doing. But it looks like it might possibly benefit from use of single-argument Nearest[]. – Daniel Lichtblau Mar 02 '12 at 16:32
  • @rcollyer, If you do, I believe you get an error message saying you can`t have nested ParallelTable[] – 500 Mar 02 '12 at 16:39
  • Both Table and ParallelTable accept multiple iterator arguments, or should. So, you could write ParallelTable[..., {subNO, ...}, {dispNo, ...}] instead of nesting the second Table inside. Note, dispNo has to go after subNO which it depends on. Does that clarify what I was asking? Or, do you still get an error message? – rcollyer Mar 02 '12 at 16:53
  • @rcollyer I think, using Table inside ParallelTable may make sense, if you want to force certain (coarse) granularity of your computations. – Leonid Shifrin Mar 02 '12 at 17:57
  • @Daniel Can you tell us how Nearest works? I timed it the other day, and the NearestFunction seemed to run in linear time in the number of points used to build it, both for 1D and 2D data with Euclidean distance. However, occasionally I got some inconsistent timings where the first run of the NearestFunction was slow, but the subsequent ones were fast. I couldn't reproduce this for large data though. Surely it must have better than linear complexity, but a naive measurement shows linear. Note I am talking about the timings for the NearestFunction, not Nearest. – Szabolcs Mar 02 '12 at 18:03
  • @LeonidShifrin I agree. I was just curious as to why he made that choice. – rcollyer Mar 02 '12 at 18:09
  • @Szabolcs I really should know the answer but I'm not recalling all details. For Euclidean distance and machine precision it should be substantially faster for lookup, approaching log(n) if you want some smallish constant number of nearby neighbors to a given value. There is a degradation with dimension but of course in 1 or 2 D this is not relevant. – Daniel Lichtblau Mar 02 '12 at 18:43
  • @Szabolcs I should also have said something about internals. Nearest uses an octree behind the scenes. So lookup tends to be fast. Or should be. If you are seeing discrepant behavior you might consider sending a bug report (can send to me, if you like). – Daniel Lichtblau Mar 02 '12 at 18:47
  • @Daniel, Do you think Nearest would be faster in my case > I have never used that function yet ! – 500 Mar 02 '12 at 21:41
  • @500 Possibly Nearest would be useful (as in fast). As I said though, I'm not exactly sure what it is you arwe doing. – Daniel Lichtblau Mar 02 '12 at 22:32

1 Answers1

3

EDIT

Apparently, I have misunderstood the problem. Here is the solution which, for smaller tests, produces the results identical to the original one:

getDistantPoints = 
  Compile[{{pts, _Real, 2}, {delta, _Real}},
     Module[{res = Table[{0., 0.}, {Length[pts]}], ctr = 1},
        res[[1]] = pts[[1]];
        Do[
          If[Norm[pts[[i]] - res[[ctr]]] > delta, 
            res[[++ctr]] = pts[[i]]
          ], 
          {i, Length[pts]}];
        Take[res, ctr]],
     CompilationTarget -> "C", RuntimeOptions -> "Speed"]


Clear[GZFastAlt];
GZFastAlt[delta_, data_] :=
  Module[{ldata = data},
     ParallelTable[
       Table[
          getDistantPoints[ldata [[subNO, dispNo, All, ;; 2]], delta],
          {dispNo, Range[Length[ldata [[subNO]]]]}
       ], {subNO, Range[5]}]];

and runs in about 2 seconds on my 6 cores:

(res = GZFastAlt[0.1,allGazesX]);//AbsoluteTiming
{2.2451172,Null}

END EDIT

As a bonus, this keeps things packed, which is a big deal for your data - even in packed form, the computation consumes quite a bit of memory.

Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
  • 1
    Are these identical? In the original, the value of z is a moving target, so to speak. – Daniel Lichtblau Mar 02 '12 at 16:32
  • @Daniel Good point! I thought z was only used as a recording device, and did not think that it can be important. Actually, z being a moving target makes more sense. Will think of a modification for this case, and hopefully updates soon. – Leonid Shifrin Mar 02 '12 at 17:05
  • @Daniel Ok, fixed (hopefully). Thanks for spotting it! – Leonid Shifrin Mar 02 '12 at 17:36
  • You're welcome. As regards things that get spotted, I'm just glad it was a wandering 'z and not a hungry leopard. [Exit stage left, ducking rotting vegetables from audience.] – Daniel Lichtblau Mar 02 '12 at 18:46
  • @Daniel That assumes the audience was well-prepared, and actually expecting the failure :) – Leonid Shifrin Mar 02 '12 at 19:00