use Featureunion in scikit-learn to combine two pandas columns for tfidf

BLodge picture BLodge · Jan 10, 2016 · Viewed 8.6k times · Source

While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.

I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the spam/ham label is df['ham/spam']

I receive the following error: TypeError: 'FeatureUnion' object is not iterable

How can I use both df['Subject'] and df['body_text'] as features all while running them through the pipeline function?

from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))

pipeline = Pipeline([
('count_vectorizer',  CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer',  TfidfTransformer()),
('classifier',  MultinomialNB())])

pipeline.fit(combined_2, df['ham/spam'])

k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
    train_text = combined_2.iloc[train_indices]
    train_y = df.iloc[test_indices]['ham/spam'].values

    test_text = combined_2.iloc[test_indices]
    test_y = df.iloc[test_indices]['ham/spam'].values

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(test_text)
    prediction_prob = pipeline.predict_proba(test_text)

    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label='spam')
    scores.append(score)

Answer

David Maust picture David Maust · Jan 10, 2016

FeatureUnion was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.

CountVectorizer is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer.

combined_2 = df['Subject'] + ' '  + df['body_text']

An alternative method would be to run CountVectorizer and optionally TfidfTransformer individually on each column, and then stack the results.

import scipy.sparse as sp

subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])

body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])

combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')

A third option is to implement your own transformer that would extract a dataframe column.

class DataFrameColumnExtracter(TransformerMixin):

    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]

In that case you could use FeatureUnion on two pipelines, each containing your custom transformer, then CountVectorizer.

subj_pipe = make_pipeline(
       DataFrameColumnExtracter('Subject'), 
       CountVectorizer()
)

body_pipe = make_pipeline(
       DataFrameColumnExtracter('body_text'), 
       CountVectorizer()
)

feature_union = make_union(subj_pipe, body_pipe)

This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.

 sparse_matrix_of_counts = feature_union.fit_transform(df)

This feature union can also be added as the first step in a larger pipeline.