Question:

What data structure (if any) is most efficient (fastest) for performing k-nearest neighbors (knn) search for finderful.com?

Experiment:

On a dataset of about 7500 WGS 84 Web Mercator coordinates, we attempt to find the k-nearest neighbors (where k=32) of each coordinate. Each coordinate represents an available home in Australia. Although we do not search for the neighbors of only these coordinates, it is a reasonably accurate estimate of our use case. We time how long it takes to search for the neighbors of all coordinates and record our findings below.

We test the implementations found in weka 3.9.0: LinearNNSearch, CoverTree, BallTree, and KDTree. LinearNNSearch appears to use a heap. We also test our own implementation, which uses Java 1.8’s TreeMap (red-black trees).

We use Euclidean distance in degrees as our measure, and do not attempt to approximate meters. For the weka NearestNeighbourSearches, we use weka’s EuclideanDistance class, and for our implementation we use our own Comparator.

Also for the weka NearestNeighbourSearches, we include the time of converting to and from Instances. We repeated the weka implementations a few times. We did not repeat our implementation.

Results:

Analysis:

Unsurprisingly, the Ball Tree and the KD-Tree did quite well, as they trim the search space significantly before they perform the search.

The Linear kNN Search iterates over the 7500 coordinates, calculates the distance, then stores the distance and coordinate in a heap. weka keeps the heap around size k by keeping track of the furthest coordinate in the heap, so that the heap always contains the k-nearest neighbors seen so far. The resulting algorithm complexity is *O(n log(k))*, and the distance is calculated once for each of the *n* coordinates.

Our own custom implementation is very simple. We simply dump the coordinates into a TreeMap and pull out the first 32. The Tree is sorted by distance using a custom comparator. As you can see from the results, this simple method does quite poorly. One reason is that the algorithm complexity is *O(n log(n))* and another is that we calculate the distance twice for each comparison.