How to find the nearest neighbors of 1 Billion records with Spark?

apache-spark pyspark spark-dataframe nearest-neighbor euclidean-distance

Osiris · May 3, 2016 · Viewed 14.1k times · Source

Given 1 Billion records containing following information:

    ID  x1  x2  x3  ... x100
    1   0.1  0.12  1.3  ... -2.00
    2   -1   1.2    2   ... 3
    ...

For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).

What's the best way to compute this?

Answer

The gist of it is: