I am working with pyspark, and wondering if there is any smart way to get euclidean dstance between one row entry of array and the whole column. For instance, there is a dataset like this.
+--------------------+---+
| features| id|
+--------------------+---+
|[0,1,2,3,4,5 ...| 0|
|[0,1,2,3,4,5 ...| 1|
|[1,2,3,6,7,8 ...| 2|
Choose one of the column i.e. id==1, and calculate the euclidean distance. In this case, the result should be [0,0,sqrt(1+1+1+9+9+9)]. Can anybody figure out how to do this efficiently? Thanks!
If you want euclidean for a fixed entry with a column, simply do this.
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from scipy.spatial import distance
fixed_entry = [0,3,2,7...] #for example, the entry against which you want distances
distance_udf = F.udf(lambda x: float(distance.euclidean(x, fixed_entry)), FloatType())
df = df.withColumn('distances', distance_udf(F.col('features')))
Your df will have a column of distances.