How to efficiently find k-nearest neighbours in high-dimensional data?

algorithm data-structures computational-geometry nearest-neighbor dimensionality-reduction

Benno · Oct 18, 2010 · Viewed 7.9k times · Source

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

Answer

The Wikipedia article for kd-trees has a link to the ANN library:

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

As far as algorithm/data structures are concerned:

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

How to efficiently find k-nearest neighbours in high-dimensional data?

Answer

Related questions