when i try to feed df2 to kmeans i get the following error
clusters = KMeans.train(df2, 10, maxIterations=30,
runs=10, initializationMode="random")
The error i get:
Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
df2 is a dataframe created as follow:
df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')
df2.show()
latitude| longitude|
60.1643075| 24.9460844|
60.4686748| 22.2774728|
how can i convert this two columns to Vector and feed it to KMeans?
The problem is that you missed the documentation's example, and it's pretty clear that the method train
requires a DataFrame
with a Vector
as features.
To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:
from pyspark.sql.functions import *
vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
outputCol="features")
# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c)
for c in vectorAssembler.getInputCols()]
df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)
Besides, you should also normalize your features
using the class MinMaxScaler to obtain better results.
In order to achieve this using MLLib
you need to use a map function first, to convert all your string
values into Double
, and merge them together in a DenseVector.
rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))
After this point you can train your MLlib's KMeans model using the rdd
variable.