Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

Jack Lv picture Jack Lv · Dec 10, 2016 · Viewed 7k times · Source

Given my pyspark Row object:

>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>

However, row.features failed to pass isinstance(row.features,Vector) test.

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False

This strange error made me in huge trouble. Without passing "isinstance(row.features, Vector)," I am not able to generate LabeledPoint using map function. I will be really grateful if anyone can solve this problem.

Answer

zero323 picture zero323 · Dec 10, 2016

It is is unlikely an error. You didn't provide a code required to reproduce the issue but most likely you use Spark 2.0 with ML transformers and you compare wrong entities.

Let's illustrate that with an example. Simple data

from pyspark.ml.feature import OneHotEncoder

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0, )]).toDF(["x"])
).first()

Now lets import different vector classes:

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import  LabeledPoint

and make tests:

isinstance(row.features, MLLibVector)
False
isinstance(row.features, MLVector)
True

As you see what we have is pyspark.ml.linalg.Vector not pyspark.mllib.linalg.Vector which is not compatible with the old API:

LabeledPoint(0.0, row.features)
TypeError                                 Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

You could convert ML object to MLLib one:

from pyspark.ml import linalg as ml_linalg

def as_mllib(v):
    if isinstance(v, ml_linalg.SparseVector):
        return MLLibVectors.sparse(v.size, v.indices, v.values)
    elif isinstance(v, ml_linalg.DenseVector):
        return MLLibVectors.dense(v.toArray())
    else:
        raise TypeError("Unsupported type: {0}".format(type(v)))

LabeledPoint(0, as_mllib(row.features))
LabeledPoint(0.0, (1,[],[]))

or simply:

LabeledPoint(0, MLLibVectors.fromML(row.features))
LabeledPoint(0.0, (1,[],[]))

but generally speaking you should avoid situations when it is necessary.