How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

SparkUser picture SparkUser · May 7, 2015 · Viewed 8k times · Source

I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable.

In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables.

How can I perform this with Spark?

Answer

xmar picture xmar · Dec 12, 2017

Using VectorIndexer, you may tell the indexer the number of different values (cardinality) that a field may have in order to be considered categorical with the setMaxCategories() method.

val indexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexed")
.setMaxCategories(10)

From Scaladocs:

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

Automatically identify categorical features (default behavior)

This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.

Set maxCategories to the maximum number of categorical any categorical feature should have.

E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous.

I find this a convenient (though coarse-grained) way to extract the categorical values, but beware if in any case you have a field with lower arity that you want to be continuous (e.g. age in college students vs origin country or US-state).