I have several categorical features and would like to transform them all using OneHotEncoder
. However, when I tried to apply the StringIndexer
, there I get an error:
stringIndexer = StringIndexer(
inputCol = ['a', 'b','c','d'],
outputCol = ['a_index', 'b_index','c_index','d_index']
)
model = stringIndexer.fit(Data)
An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
Py4JJavaError: An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
I think the above code will not give the same results as required. In the encoders section, there is required a little modification. Because, again the StringIndexer is applied on Indexers.So, that will results in the same results.
#In the following section:
encoders = [
StringIndexer(
inputCol=indexer.getOutputCol(),
outputCol="{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers
]
#Replace the StringIndexer with OneHotEncoder as follows:
encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),
outputCol="{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers
]
Now, the complete code look like the following:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
categorical_columns= ['Gender', 'Age', 'Occupation', 'City_Category','Marital_Status']
# The index of string vlaues multiple columns
indexers = [
StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
for c in categorical_columns
]
# The encode of indexed vlaues multiple columns
encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),
outputCol="{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers
]
# Vectorizing encoded values
assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders],outputCol="features")
pipeline = Pipeline(stages=indexers + encoders+[assembler])
model=pipeline.fit(data_df)
transformed = model.transform(data_df)
transformed.show(5)
For more details,please refer: visit:[1] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer visit:[2] https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder.