Pyspark euclidean distance between entry and column

Yong Hyun Kwon picture Yong Hyun Kwon · Oct 13, 2017 · Viewed 7.2k times · Source

I am working with pyspark, and wondering if there is any smart way to get euclidean dstance between one row entry of array and the whole column. For instance, there is a dataset like this.

+--------------------+---+
|            features| id|
+--------------------+---+
|[0,1,2,3,4,5     ...|  0|
|[0,1,2,3,4,5     ...|  1|
|[1,2,3,6,7,8     ...|  2|

Choose one of the column i.e. id==1, and calculate the euclidean distance. In this case, the result should be [0,0,sqrt(1+1+1+9+9+9)]. Can anybody figure out how to do this efficiently? Thanks!

Answer

mayank agrawal picture mayank agrawal · Oct 15, 2017

If you want euclidean for a fixed entry with a column, simply do this.

import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from scipy.spatial import distance

fixed_entry = [0,3,2,7...] #for example, the entry against which you want distances
distance_udf = F.udf(lambda x: float(distance.euclidean(x, fixed_entry)), FloatType())
df = df.withColumn('distances', distance_udf(F.col('features')))

Your df will have a column of distances.