5

I have a set of points, representing location of events in a map, and I want to find the closest neighbour for each point. Variable xys is a 2D matrix containing the set of coordinates for each point (column 1 for x- and column 2 for y-coordinate).

xys = Table[RandomReal[{1, 9000}], {rr, 1, 300}, {cc, 1, 2}]

To calculate the distance I come up with two approaches:

Approach 1

Table[ ArcLength[Line[{xys[[jj]], xys[[ii]]}]], {ii, 1, Length[xys]}, {jj, 1, Length[xys]}] 

{21.8557, Null}

Approach 2

Table[Sqrt[(xys[[1, 1]] - xys[[ii, 1]])^2 + (xys[[1, 2]] - xys[[ii, 2]])^2], {ii, 1, 1000}]; 

{0.468003, Null}

Approach 1 quite inefficient: for calculating 300 points, it takes 21 sec. Approach 2 is almost two order of magnitude faster, at 0.4 sec.

What I would like to understand is why is there such a big difference. I thought ArcLength[Line[...]] would be the faster and more efficient since it uses Mathematica functions. Since my real data set ultimately will be 10000-30000 points, the whole process will end up to be quite long even with the second approach. Therefore I would be grateful if anyone could suggest an even more efficient and faster approach.

Thanks,

Sange
  • 325
  • 1
  • 9

3 Answers3

9

Reading the documentation of ArcLength, it seems to perform an integration to determine the length of an arbitrarily complex arc. It is therefore written to support much more complex shapes than simple point-to-point lines and is simply oversized for your purpose.

The built-in function you are probably looking for is EuclideanDistance, which is about two orders of magnitude faster than ArcLength:

ArcLength[Line[{{0, 0}, {1, 0}}]] // RepeatedTiming // ScientificForm
EuclideanDistance[{0, 0}, {1, 0}] // RepeatedTiming // ScientificForm

{2.0*10^(-4), 1}

{4.97*10^(-6), 1}

As for the faster solutions, they have already been stated in the comments and answers. Nearest is usually the fastest way to do such tasks. By the way, when constructing a DistanceMatrix as suggested by @kglr, it also uses EuclideanDistance as the default DistanceFunction for numerical data.

Theo Tiger
  • 1,273
  • 8
  • 10
8
xys = RandomReal[{1, 9000}, {10, 2}] 
DistanceMatrix[xys] //MatrixForm // TeXForm

$\tiny\left( \begin{array}{cccccccccc} 0. & 2811.96 & 1448.01 & 4793.19 & 4974.56 & 3210.02 & 3812.18 & 4829.63 & 3786.22 & 4540.3 \\ 2811.96 & 0. & 2223.34 & 7489.06 & 2545.11 & 5986.82 & 6588.37 & 4435.52 & 6566.55 & 1979.34 \\ 1448.01 & 2223.34 & 0. & 6082.27 & 3805.6 & 4428.17 & 5014.48 & 5624.91 & 4980.33 & 4194.48 \\ 4793.19 & 7489.06 & 6082.27 & 0. & 9767.75 & 1683.23 & 1176.8 & 7061.33 & 1226.24 & 8869.71 \\ 4974.56 & 2545.11 & 3805.6 & 9767.75 & 0. & 8164.34 & 8762.34 & 6654.91 & 8732.36 & 2983.62 \\ 3210.02 & 5986.82 & 4428.17 & 1683.23 & 8164.34 & 0. & 602.571 & 6362.66 & 579.835 & 7528.06 \\ 3812.18 & 6588.37 & 5014.48 & 1176.8 & 8762.34 & 602.571 & 0. & 6798.88 & 53.2738 & 8113.77 \\ 4829.63 & 4435.52 & 5624.91 & 7061.33 & 6654.91 & 6362.66 & 6798.88 & 0. & 6812.2 & 3916.62 \\ 3786.22 & 6566.55 & 4980.33 & 1226.24 & 8732.36 & 579.835 & 53.2738 & 6812.2 & 0. & 8100.6 \\ 4540.3 & 1979.34 & 4194.48 & 8869.71 & 2983.62 & 7528.06 & 8113.77 & 3916.62 & 8100.6 & 0. \\ \end{array} \right)$

Nearest[xys -> {"Index", "Element", "Distance"}, xys, 2][[All, 2]]

{{3, {4186.12, 3056.38}, 1448.01},
{10, {126.15, 4110.1}, 1979.34},
{1, {4655.43, 4426.22}, 1448.01},
{7, {7903.7, 6421.58}, 1176.8},
{2, {2004.91, 3487.15}, 2545.11},
{9, {7905.61, 6368.34}, 579.835},
{9, {7905.61, 6368.34}, 53.2738},
{10, {126.15, 4110.1}, 3916.62},
{7, {7903.7, 6421.58}, 53.2738},
{2, {2004.91, 3487.15}, 1979.34}}

kglr
  • 394,356
  • 18
  • 477
  • 896
4

These are the tests and timings of different approaches, and lastly what I think is the best implementation

xys = Table[RandomReal[{1, 9000}], {rr, 1, 1000}, {cc, 1, 2}];

dist = Table[EuclideanDistance[xys[[ii, 1]], xys[[jj, 1]]], {ii, 1, Length[xys]}, {jj, 1, Length[xys]}]; // Timing

dist = DistanceMatrix[xys]; // Timing

{1.38841, Null}

{0.0312002, Null}

The EuclideanDistance is indeed the right function, as Theo Tiger has nicely explained. However, the inbuilt function DistanceMatrix is 1-2 order of magnitude faster. Therefore, DistanceMatrix appears to be the better option for calculating all distance between points and generate a matrix.

Flatten[Table[TakeSmallest[dist[[ii, All]], 11][[2 ;; 11]], {ii, 1, Length[dist]}]]; // Timing

Flatten[Table[Sort[dist[[ii, All]]][[2 ;; 11]], {ii, 1, Length[dist]}]]; // Timing

{0.0780005, Null}

{0.156001, Null}

For finding the first nearest object(s), Sort is a good way, but my test I found that TakeSmallest is an even better approach. In conclusion DistanceMatrix and TakeSmallest are the one I used for my solution and I when I computed to find the 10 closest neighbors' distances for 10 sets of 10000 points each, it took mere seconds

Table[
xys = RandomReal[{1, 9000}, {10000, 2}];

dist[rr] = DistanceMatrix[xys];

nearneighbor[rr] = 
Flatten[Table[ 
  TakeSmallest[dist[rr][[ii, All]], 11][[2 ;; 10]], {ii, 1, 
   Length[dist[rr]]}]];

, {rr, 1, 10}]; // Timing

{3.47882, Null}

Thank you all for your help.

Sange
  • 325
  • 1
  • 9