What's the difference between sklearn Pipeline and DataFrameMapper?

nkhuyu picture nkhuyu · Nov 1, 2016 · Viewed 7k times · Source

Sklearn Pipeline: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

DataFrameMapper: https://github.com/paulgb/sklearn-pandas

What's the difference between them?

It seems to me that sklearn pipeline has more features, but DataFrameMapper is more clean for me to use.

Answer

NirIzr picture NirIzr · Nov 1, 2016

Edit: See comment discussion about the perhaps better ColumnTransformer approach. A comparison table that may be of interest can be found Here.


Bascially, the DataFrameMapper (and the entire sklearn-pandas package) aims to combine the benefits of pandas DataFrame objects with the power of the sklearn machine learning package.

An sklearn.Pipeline describes an ordered list of transformations to perform on a matrix formats (provided by the numpy and scipy packages). Those transformations will be executed on the entire matrix one after the other in a sequence, and will encode an entire training and prediction process from start to end.

The first part of the pipeline transformation (aka step) tuple is it's name and it has no actual effect on the process, it's simply used for readability.

A DataFrameMapper on the other hand operates on DataFrame objects created by the pandas package, and can apply transformations to certain parts of the dataframe (not necessarily on the entire data set). DataFrames are similar to numpy and scipy matrices, with a single noticeable difference of keeping track of row and column labels and metadata. DataFrames operations will make sure rows and columns can be accessed, manipulated and read using logical identifications instead of indices.

The first part of the DataFrameMapper transformation tuple describes which columns should be run through the transformation.

Three big differences are:

  1. A DataFrameMapper is a more flexible tool, it'll let you execute different transformations on different columns and is more oriented to transforming complex data structures, while sklearn.Pipeline is more oriented to executing machine learning related transformations on a homogeneous data set.
  2. A DataFrameMapper will let you keep the annotations and labels you've assigned to your pandas DataFrame object, while sklearn.Pipeline will "reduce" any result to a numpy/scipy array/matrix.
  3. sklearn.Pipeline is part of a far more stable and well known package, and therefore may be the "safer" choice if considerations such as stability and maintenance are of importance.