8

Suppose we have a point set, the points are labeled 1,2,3,....n. And p[i] is the coordinate of the point. Now I want the label of nearest and next nearest point (maybe several) corresponding every point in this point set.

For example:

p[1] = {0, 0};
p[2] = {0, 1};
p[3] = {1, 0};
p[4] = {0, 5};

and I want a list for nearest neighbour like this:

{{2, 3}, {1}, {1}, {2}}

it means that p[2] and p[3] is nearest to p[1], p[1] is nearest to p[2], etc.

And I need a similar next nearest neighbour list.

I write the following code:

num = 5000;
Do[p[i] = RandomInteger[{1, 100}, 2], {i, 1, num}];

coolist = Table[p[i], {i, 1, num}];

Clear[position];
position[expr_] := 
  With[{positionData = 
     SortBy[#[[1, 1]] -> #[[All, 2]] & /@ 
        GatherBy[
         Extract[expr, #, Verbatim] -> # & /@ 
          Position[expr, _, Depth[expr]], First], 
       Min[Length /@ #[[2]]] &] // Dispatch}, 
   Replace[#, positionData] &];

poscoolist = position[coolist];

Clear[nnsite];
nnsite[k_, coolist_] := Module[{nncoolist},
   nncoolist = Nearest[Complement[coolist, {p[k]}], p[k]];
   Flatten@
    Table[poscoolist[nncoolist[[i]]], {i, 1, Length[nncoolist]}]];

nearestlist = Table[nnsite[i, coolist], {i, 1, num}]; // AbsoluteTiming

Clear[nnlabel];
Thread[Evaluate@Array[nnlabel, num] = nearestlist];

Clear[nnnsite];
nnnsite[k_, coolist_] := Module[{nnncoolist},
   nnncoolist = 
    Nearest[Complement[
      Delete[coolist, Partition[nnlabel[k], 1]], {p[k]}], p[k]];
   Flatten@
    Table[poscoolist[nnncoolist[[i]]], {i, 1, Length[nnncoolist]}]];

nextnearestlist = 
   Table[nnnsite[i, coolist], {i, 1, num}]; // AbsoluteTiming

the function position in the above code is provide by Mr.Wizard (see here). nearestlist and nextnearestlist give the result.

Notice to use my code, supply every coordinate p[i] of points.

For 5000 points, it takes 1 minutes. But for 10000 points, it takes 6 minutes. Quite long!

I feel that my code is so simple, there must be better ways that are more efficient.

matheorem
  • 17,132
  • 8
  • 45
  • 115
  • 5
    It looks like you're calling Nearest several times. Why not just call it once to get a precomputed NearestFunction, and use that to find the three nearest sites to any given point? The first one will be the site itself, the next two will be the points you want. It will be much more efficient. (I'd post an answer but I don't have access to Mathematica right now.) –  Dec 10 '13 at 00:52
  • @RahulNarain Oh, thank you. NearestFunction! Let me see it. – matheorem Dec 10 '13 at 00:56
  • @RahulNarain I lookup in the doc. I found that NearestFunction may not be working well in this case. because the number of nearest point is not fixed, maybe one and maybe two, even three.... – matheorem Dec 10 '13 at 01:04
  • @matheorem -- then set it to always return 3 points (or however many you need), and throw away the extra ones. – bill s Dec 10 '13 at 01:08
  • Check the documentation again. NearestFunction[...][x, n] gives the $n$ nearest elements to $x$. So you can ask for as many nearest points as you want. –  Dec 10 '13 at 01:34
  • @RahulNarain but the environment of every point is different. I can't set a fixed n? – matheorem Dec 10 '13 at 01:50
  • 1
    I don't understand what you're asking. –  Dec 10 '13 at 01:55
  • OK, return to my 4 point toy example. Let nnn = Nearest[{p[1], p[2], p[3], p[4]}] . Then nnn[p[1], 1] gives{{0, 0}}, nnn[p[1], 2] gives {{0, 0}, {0, 1}}, and nnn[p[1], 3] gives {{0, 0}, {0, 1}, {1, 0}}. But actually {0,1} and {1,0} has the same distance to p[1], so in this case p[1] has two nearest point. But if you see p[4], it only has one nearest point. – matheorem Dec 10 '13 at 02:54
  • Hi matheorem. I don't get your criterion. Why not {{2, 3}, {1, 4}, {1, 2}, {1, 2}}? since p[2] is the "next nearest point" to p[3]? – Silvia Dec 10 '13 at 04:00
  • 1
    I see, you want all the points at the nearest distance to $x$ in a single list. And then you want another list containing all the points at the second nearest distance. I suppose you could write a function that calls NearestFunction with larger and larger $n$ until it encounters a point at the third nearest distance... –  Dec 10 '13 at 04:16
  • @Silvia I want two list, one list contain nearest point labels, the other list contains next nearest point labels. Since p[2] is the "next nearest point" to p[3], so it should not be contained in the nearestlist – matheorem Dec 10 '13 at 04:16
  • @RahulNarain I'm glad I made myself clear and thank you for your interest in this question. But I don't have a fixed third nearest distance, it varies from point to point. – matheorem Dec 10 '13 at 04:21
  • Then p[4] is the next nearest point to p[2], why is it there? – Silvia Dec 10 '13 at 04:26
  • @Silvia Oh, my god! I made a mistake, I am so sorry. I have corrected it. But the code I write is right, I think. – matheorem Dec 10 '13 at 04:33
  • I don't think your code is compatible with your description. e.g. for input {{1, 2}, {1, 0}, {2, 2}, {2, 1}}, your code gives {{4}, {1}, {2}, {2, 1}}, where the nearest set should be {{3}, {4}, {1, 4}, {3}}. – Silvia Dec 10 '13 at 05:48
  • @Silvia No, You didn't use my code properly:). You should set coolist={{1, 2}, {1, 0}, {2, 2}, {2, 1}} and Do[p[i] = coolist[[i]], {i, 1, 4}] – matheorem Dec 10 '13 at 06:05
  • 1
    Nearest also supports a Nearest[data, x, {n, r}] syntax which returns $n$ nearest points within radius $r$, see: http://mathematica.stackexchange.com/a/34895/131 – Yves Klett Dec 10 '13 at 07:11

4 Answers4

8

Here's a simple solution using a single precomputed NearestFunction; much faster than $O(n^2)$. I've written it assuming the sites are in a list ps, rather than embedded inside a function p, because I think this way is easier to generate and manipulate. You may want to modify the code as appropriate.

num = 5000;
ps = RandomInteger[{1, 100}, {num, 2}];

nf = Nearest[Table[ps[[i]] -> i, {i, Length[ps]}]] (* so it returns the index of the site *)

upToNthNearestSites[k_, 0] := {k} (* the "zeroth" nearest neighbour, i.e. itself *)
upToNthNearestSites[k_, n_] := Module[{pk, near, d},
  pk = ps[[k]];
  near = upToNthNearestSites[k, n - 1]; (* get the nearest neighbours up to order n-1 *)
  near = nf[pk, Length[near] + 1]; (* get one more; this is one of the nth nearest *)
  d = N@EuclideanDistance[pk, ps[[Last@near]]]; (* distance to the nth nearest *)
  nf[pk, {Infinity, d}] (* the solution is all the sites up to that distance *)
  ]
nthNearestSites[k_, n_] := Module[{pk, near0, near, d},
  pk = ps[[k]];
  near0 = upToNthNearestSites[k, n - 1];
  near = nf[pk, Length[near0] + 1];
  d = N@EuclideanDistance[pk, ps[[Last@near]]];
  near = nf[pk, {Infinity, d}];
  Complement[near, near0] (* same as above except remove neighbours closer than n *)
  ]

The nearest neighbours to the $k$th site are given by nthNearestSites[k, 1], the second nearest by nthNearestSites[k, 2], and so on. On my machine, even with a million points, the initial construction of nf takes a little over a second, and after that nthNearestSites[1, 2] takes negligible time.

Edit: I forgot that you want the neighbours of all the sites collected in a big list. Well, then you just do

nearestSites = nthNearestSites[#, 1] & /@ Range[Length[ps]];
nextNearestSites = nthNearestSites[#, 2] & /@ Range[Length[ps]];

On a hundred thousand sites, these take 2.9 and 6.8 seconds on my machine respectively. On a million, they will probably take a couple of minutes.

P.S.

  1. You could just define nthNearestSites[k_, n_] := Complement[upToNthNearestSites[k, n], upToNthNearestSites[k, n - 1]], but that would end up evaluating the $(n-1)$th neighbours twice (as well as the $(n-2)$th, the $(n-3)$th, and so on). In the implementation above, it makes exactly $2n$ calls to the NearestFunction.

  2. I'm not too happy about having to put the N around EuclideanDistance. Unfortunately, NearestFunction doesn't accept something like $\sqrt2$ as the search radius.

  • Thank you for your answer! I am testing now, and I also found the EuclideanDistance problem, so I use N and add small offset 0.0000001 to it, I found this small offset is necessary, see my answer. – matheorem Dec 10 '13 at 08:57
  • I see your answer but it doesn't tell me why the offset is necessary. –  Dec 10 '13 at 09:03
  • Well, you can try this ps={{0, 0}, {Sqrt[2], 0}, {0, Sqrt[2]}, {Sqrt[2], Sqrt[2]}}. Very odd,.but it's true, offset is necessary. – matheorem Dec 10 '13 at 09:49
  • Ah... Well, another workaround is that if you set ps = N@{{0, 0}, ...} instead then it works. –  Dec 10 '13 at 10:45
  • Ok, are you considering to improve your answer with N? Though my method is identical to yours, but I make the label a little complex, so I'll accept yours, if there is no better method – matheorem Dec 10 '13 at 11:04
  • I'm sorry, I don't understand what you mean by "improve your answer with N". I'm not planning to change anything in my answer at this point. Also, I don't quite understand your answer either as it does not have an explanation, but I'll take your word for it that yours and mine are identical. You could wait a while before accepting though; who knows, someone may come up with an even better solution. –  Dec 10 '13 at 11:08
  • I mean if the coordinates of points contain Sqrt, then to make the code most general and correct, either to add ps=N@ps, or add a little offset to d in your code. And in terms of the identity of your code and mine, I'm not claim any precedence here, because anyway you enlightened me to use NearestFunction first. But I read through your code and test it, so I am sure we had the same idea. And finally, if I had know that you would post an answer, I wouldn't spend a whole afternoon to design a code myself :). especially when yours is more general and efficient than mine. Thank you so much! – matheorem Dec 10 '13 at 11:20
  • Sorry, I found bugs. See this ps list here http://pastebin.com/GXtnVxYi This ps list contains Sqrt, and if you try ps=N@ps and add small offset to d respectively, you will find their result is different. And I can confirm, add offset gives right result, because the right result will compatible with my other code. But merely add small offset without ps=N@ps will effect the performance much. So I conclude that ps=N@ps and a small offset, None is dispensable for correct and efficient code. I hope you could check this to see if my statement is right. Thank you – matheorem Dec 10 '13 at 14:26
  • Imagine you would have many of such lists. In the real world each list could represent the coordinates of objects at a certain time. How could I trace the objects in time? To do so I would need to trace a single coordinate "from list to list" by finding the nearest coordinate. That should be done for all coordinates in the first data set. – mrz Oct 18 '16 at 21:13
  • Hi, @mrz, I don't get it. Could you explain what your comment mean a little further? – matheorem Oct 19 '16 at 14:00
6

I would utilize the fact that NearestFunction is Listable, and then capture information about a large number of nearest points, and then process that information. Here is some data:

num = 200;
SeedRandom[1];
ps = N @ DeleteDuplicates @ RandomInteger[{1, 100}, {num, 2}];

Note that I apply DeleteDupicates to the dataset, so that I can compare this to your answer. I believe your handling of datasets with duplicate points has a bug.

The following calls to Nearest will result in NearestFunction objects returning the indices of the nearest points, and the distances to the nearest points:

index = Nearest[ps -> "Index"];
distance = Nearest[ps -> "Distance"];

For this example, the "large number" I will use is 10. So, we use the distance NearestFunction to get the distances to the 10 nearest points, and then tally the distances:

tally = (Part[#, All, All, 2]&) @ Map[Tally] @ distance[ps, 10];

For each point, the corresponding entry of tally will give a list {c0, c1, c2, ..} where c0 will be 1 (there is one point at distance 0), c1 will give the number of nearest points (including ties), c2 will give the number of next-nearest points, etc. The length of the shortest entry is how many next-nearest entries are available:

max = Min[Length /@ tally]

8

The 8th entry could be missing data, so, grabbing the 10 nearest points allows one to tally the counts up to the next-next-next-next-next-nearest points with this dataset.

The indices of these points are obtained using:

indices = index[ps, 10];

Finally, to grab the right indices it will be convenient to Accumulate the tallies:

accum = Accumulate /@ tally;

Now, we are ready to create a function that returns the list of nearest points for each point in data:

res[n_ /; n<=7] := MapThread[
    Take[#, {#2[[1]]+1, #2[[2]]}]&,
    {indices, accum[[All, {n, n+1}]]}
]

Let's compare to your answer:

nearestsitemodule[pslabeled];
r1 = nthNearestSites[#,1]& /@ pslabeled;
r2 = nthNearestSites[#,2]& /@ pslabeled;

r1 === res[1]
r2 === res[2]

True

True

Let's package up the code:

Options[NearestNeighbor] = {"NeighborCount" -> 20};

NearestNeighbor[data_, OptionsPattern[]] := Module[
    {index, distance, tally, max, indices, accum},

    index = Nearest[data -> "Index"];
    distance = Nearest[data -> "Distance"];
    tally = (Tally /@ distance[data, OptionValue["NeighborCount"]])[[All, All, 2]];
    max = Min[Length /@ tally];
    indices = index[data, OptionValue["NeighborCount"]];
    accum = Accumulate[{0, ##}]& @@@ tally;
    NearestNeighborFunction[indices, accum[[All, ;; max]], max-2]
]

NearestNeighborFunction[indices_, accum_, max_][n_] /; 0<=n<=max := MapThread[
    Take[#, {#2[[n+1]]+1, #2[[n+2]]}]&,
    {indices, accum}
]

MakeBoxes[i:NearestNeighborFunction[indices_, accum_, max_], StandardForm] ^:= Module[
    {
    len = Length[indices],
    g = ToBoxes @ Graphics[
        {
        PointSize[Large],
        Point[{0,0}],
        Red, Point@CirclePoints[1,3],
        Green, Point@CirclePoints[{2,0},5]
        },
        ImageSize->50]
    },

    BoxForm`ArrangeSummaryBox[
        NearestNeighborFunction,
        i,
        RawBoxes@g,
        {
        BoxForm`MakeSummaryItem[{"Data points: ", len}, StandardForm],
        BoxForm`MakeSummaryItem[{"Range: ", max}, StandardForm]
        },
        {},
        StandardForm,
        "Interpretable"->True
    ]
]

Apply NearestNeighbor to the data set, and check against previous results:

nn = NearestNeighbor[ps];

nn[1] === r1
nn[2] === r2

True

True

Apply NearestNeighbor to a larger data set:

SeedRandom[2]
num = 10^5;
data = N @ RandomInteger[{1, 10^3}, {num, 2}];

nn = NearestNeighbor[data]; //AbsoluteTiming
nn

{0.937728, Null}

enter image description here

So, with the default neighbor count of 20, the range is 5, i.e., we can find up to next-next-next-next-nearest neighbors. Find some nearest neighbor sets:

nn[0]; //AbsoluteTiming (* duplicate points *)
nn[1]; //AbsoluteTiming (* nearest neighbors *)
nn[2]; //AbsoluteTiming (* next-nearest neighbors *)
nn[5]; //AbsoluteTiming (* next-next-next-next-nearest neighbors *)

{0.236765, Null}

{0.242525, Null}

{0.239175, Null}

{0.246353, Null}

Carl Woll
  • 130,679
  • 6
  • 243
  • 355
3

Here is a way to do it. At my first attempt I did not compile the functions and was a bit slower than your reported time. With the compiled function it is faster (at least on a 2Ghz/i7 Macbook Pro) : 5000 takes 23 secs at 10000 is 95 secs.

X := RandomInteger[{0, 100}];
n = 10000; data = Table[{X, X}, {i, 1, n}] ;

tStart = AbsoluteTime[];

"First, compute matrix of square distances (faster than taking the \
square roots) -- compiling is way faster"
dm = Compile[
   {{D, _Real, 2}, n},
   Table[
    (D[[i, 1]] - D[[j, 1]])^2 + (D[[i, 2]] - D[[j, 2]])^2,
    {i, 1, n}, {j, 1, n}]
   ];
distMatrix = dm[data, n]; // AbsoluteTiming

"Now get the first and second closest distances :"
firstDistances = 
   Table[Min[Select[distMatrix[[i]], # != 0 &]], {i, 1, 
     n}]; // AbsoluteTiming
secondDistances = 
   Table[Min[
     Select[distMatrix[[i]], # != 0 && # != 
         firstDistances[[i]] &]], {i, 1, n}]; // AbsoluteTiming

"Finally find the neighbours :"
firstDistanceIndices = 
   Table[Position[distMatrix[[i]], firstDistances[[i]]], {i, 1, 
     n}]; // AbsoluteTiming
secondDistanceIndices = 
   Table[Position[distMatrix[[i]], secondDistances[[i]]], {i, 1, 
     n}]; // AbsoluteTiming

"Done. Time elapsed :"
AbsoluteTime[] - tStart

Clearly the code is $O(n^2)$ and the performance reflects that : doubling $n$ about quadruples the execution time. $n^2/10^6$ is a reasonable approximation to the actual time, at least for $n\leq10000$.

A.G.
  • 4,362
  • 13
  • 18
3

Update Based on Rahul's method. I clear some 'bug's (for example for situations when there are points have the same coordinates), and also did some improvements:

  1. using memorizing trick f[x_]:=f[x]=expr
  2. directly using Nearest[{e1->v1,e2->v2,...},x]

Testing data

num = 200;
ps = N@RandomInteger[{1, 100}, {num, 2}];
pslabeled = Thread[ps -> Range[num]];(*generate a rulelist with everypoint labeled*)

The nearestsitemodule

    Clear[nearestsitemodule];
    nearestsitemodule[pslabeled_] := 
      Module[{ps, labeldispatch, reverselabeldispatch, offset},
       labeldispatch = Dispatch[pslabeled];
       reverselabeldispatch = Dispatch[Reverse /@ pslabeled];
       offset = 0.0000001;
       nf = Nearest[pslabeled]; (*generate nearest function*)

       Clear[upToNthNearestSites];

       (*the "zeroth" nearest neighbour,i.e.itself*)
       upToNthNearestSites[coorulelabel_,0] := {Last@coorulelabel}; 

        (*upToNthNearestSites[coorulelabel,n] 
gives the label list of up to nth nearest neighbours relative to coorulelabel
           using memorizing trick f[x_]:=f[x]=expr*)
       upToNthNearestSites[coorulelabel_, n_] := 
        upToNthNearestSites[coorulelabel, n] = Module[{coo, near, d},
          coo = First@coorulelabel;
          near = upToNthNearestSites[coorulelabel, n - 1];(*get the nearest neighbours up to order n-1*)
          near = nf[coo, Length[near] + 1];(*get one more;
          this is one of the nth nearest*)
          d = N[EuclideanDistance[coo, Replace[Last[near], reverselabeldispatch]] + offset];
(*d is distance to the nth nearest, notice a small offset is necessary!!!!*)
          nf[coo, {All, d}](*the solution is all the sites up to that distance*)];

       (*nthNearestSites[coorulelabel,n] gives the label list of nth nearest neighbours to coorulelabel*)
       Clear[nthNearestSites];
       nthNearestSites[coorulelabel_, n_] := Module[{near0, near, d},
         near0 = upToNthNearestSites[coorulelabel, n - 1];
         near = upToNthNearestSites[coorulelabel, n];
         Complement[near, near0](*get label list of the nth nearest neighbour to  coorulelabel*)];
       ];

Run

nearestsitemodule[pslabeled];
nthNearestSites[#, 1] & /@ pslabeled; // AbsoluteTiming
nthNearestSites[#, 2] & /@ pslabeled; // AbsoluteTiming

The performance for 5000 points is 0.58s and 0.79s

The performance for 10000 points is 1.32s and 1.27s

Rahul claimed "On a hundred thousand sites, these take 2.9 and 6.8 seconds". I think his computer must be much faster than mine. Because my test shows, both code's running time is comparable, but as I said, I deal with points with the same coordinates.

matheorem
  • 17,132
  • 8
  • 45
  • 115