Nearest neighbors in high-dimensional data?

algorithm language-agnostic search machine-learning nearest-neighbor

Legend · Apr 22, 2011 · Viewed 64.3k times · Source

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:

Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?

Can someone please clarify the some (or all) of the above questions?

Answer

I currently study such problems -- classification, nearest neighbor searching -- for music information retrieval.

You may be interested in Approximate Nearest Neighbor (ANN) algorithms. The idea is that you allow the algorithm to return sufficiently near neighbors (perhaps not the nearest neighbor); in doing so, you reduce complexity. You mentioned the kd-tree; that is one example. But as you said, kd-tree works poorly in high dimensions. In fact, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high dimensions [1][2][3].

Among ANN algorithms proposed recently, perhaps the most popular is Locality-Sensitive Hashing (LSH), which maps a set of points in a high-dimensional space into a set of bins, i.e., a hash table [1][3]. But unlike traditional hashes, a locality-sensitive hash places nearby points into the same bin.

LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin from the hash table.

Second, there is a rigorous theory that supports its performance. It can be shown that the query time is sublinear in the size of the database, i.e., faster than linear search. How much faster depends upon how much approximation we can tolerate.

Finally, LSH is compatible with any Lp norm for 0 < p <= 2. Therefore, to answer your first question, you can use LSH with the Euclidean distance metric, or you can use it with the Manhattan (L1) distance metric. There are also variants for Hamming distance and cosine similarity.

A decent overview was written by Malcolm Slaney and Michael Casey for IEEE Signal Processing Magazine in 2008 [4].

LSH has been applied seemingly everywhere. You may want to give it a try.

[1] Datar, Indyk, Immorlica, Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions," 2004.

[2] Weber, Schek, Blott, "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces," 1998.

[3] Gionis, Indyk, Motwani, "Similarity search in high dimensions via hashing," 1999.

[4] Slaney, Casey, "Locality-sensitive hashing for finding nearest neighbors", 2008.

Nearest neighbors in high-dimensional data?

Answer

Related questions