Given my pyspark Row object:
>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>
However, row.features failed to pass isinstance(row.features,Vector) test.
>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False
This strange error made me in huge trouble. Without passing "isinstance(row.features, Vector)," I am not able to generate LabeledPoint using map function. I will be really grateful if anyone can solve this problem.
It is is unlikely an error. You didn't provide a code required to reproduce the issue but most likely you use Spark 2.0 with ML transformers and you compare wrong entities.
Let's illustrate that with an example. Simple data
from pyspark.ml.feature import OneHotEncoder
row = OneHotEncoder(inputCol="x", outputCol="features").transform(
sc.parallelize([(1.0, )]).toDF(["x"])
).first()
Now lets import different vector classes:
from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
and make tests:
isinstance(row.features, MLLibVector)
False
isinstance(row.features, MLVector)
True
As you see what we have is pyspark.ml.linalg.Vector
not pyspark.mllib.linalg.Vector
which is not compatible with the old API:
LabeledPoint(0.0, row.features)
TypeError Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
You could convert ML object to MLLib one:
from pyspark.ml import linalg as ml_linalg
def as_mllib(v):
if isinstance(v, ml_linalg.SparseVector):
return MLLibVectors.sparse(v.size, v.indices, v.values)
elif isinstance(v, ml_linalg.DenseVector):
return MLLibVectors.dense(v.toArray())
else:
raise TypeError("Unsupported type: {0}".format(type(v)))
LabeledPoint(0, as_mllib(row.features))
LabeledPoint(0.0, (1,[],[]))
or simply:
LabeledPoint(0, MLLibVectors.fromML(row.features))
LabeledPoint(0.0, (1,[],[]))
but generally speaking you should avoid situations when it is necessary.