How to find the nearest neighbors of 1 Billion records with Spark?

Osiris picture Osiris · May 3, 2016 · Viewed 14.1k times · Source

Given 1 Billion records containing following information:

    ID  x1  x2  x3  ... x100
    1   0.1  0.12  1.3  ... -2.00
    2   -1   1.2    2   ... 3
    ...

For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).

What's the best way to compute this?

Answer

xenocyon picture xenocyon · Jul 28, 2016

As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/

The gist of it is:

  • Use sklearn’s k-NN fit() method centrally
  • But then use sklearn’s k-NN kneighbors() method distributedly