Tokenizing using Pandas and spaCy

LMGagne picture LMGagne · Oct 27, 2017 · Viewed 10.7k times · Source

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.

I've tried things like: df['new_col'] = [token for token in (df['col'])]

but would definitely appreciate some help/resources.

full (albeit messy) code available here

Answer

Peter picture Peter · Oct 27, 2017

I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:

import spacy
nlp = spacy.load('en')

df['new_col'] = df['text'].apply(lambda x: nlp(x))

Note that nlp by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x) instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).