I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.
I've tried things like:
df['new_col'] = [token for token in (df['col'])]
but would definitely appreciate some help/resources.
I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:
import spacy
nlp = spacy.load('en')
df['new_col'] = df['text'].apply(lambda x: nlp(x))
Note that nlp
by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x)
instead of nlp(x)
, or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False)
.