Sklearn Pipeline: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
DataFrameMapper: https://github.com/paulgb/sklearn-pandas
What's the difference between them?
It seems to me that sklearn pipeline has more features, but DataFrameMapper is more clean for me to use.
Edit: See comment discussion about the perhaps better ColumnTransformer
approach. A comparison table that may be of interest can be found Here.
Bascially, the DataFrameMapper
(and the entire sklearn-pandas package) aims to combine the benefits of pandas DataFrame
objects with the power of the sklearn machine learning package.
An sklearn.Pipeline
describes an ordered list of transformations to perform on a matrix formats (provided by the numpy and scipy packages). Those transformations will be executed on the entire matrix one after the other in a sequence, and will encode an entire training and prediction process from start to end.
The first part of the pipeline transformation (aka step) tuple is it's name and it has no actual effect on the process, it's simply used for readability.
A DataFrameMapper
on the other hand operates on DataFrame objects created by the pandas package, and can apply transformations to certain parts of the dataframe (not necessarily on the entire data set). DataFrames are similar to numpy and scipy matrices, with a single noticeable difference of keeping track of row and column labels and metadata. DataFrames operations will make sure rows and columns can be accessed, manipulated and read using logical identifications instead of indices.
The first part of the DataFrameMapper
transformation tuple describes which columns should be run through the transformation.
Three big differences are:
DataFrameMapper
is a more flexible tool, it'll let you execute different transformations on different columns and is more oriented to transforming complex data structures, while sklearn.Pipeline
is more oriented to executing machine learning related transformations on a homogeneous data set.DataFrameMapper
will let you keep the annotations and labels you've assigned to your pandas DataFrame
object, while sklearn.Pipeline
will "reduce" any result to a numpy/scipy array/matrix.sklearn.Pipeline
is part of a far more stable and well known package, and therefore may be the "safer" choice if considerations such as stability and maintenance are of importance.