How to filter tokens from spaCy document

Kon Pal picture Kon Pal · Jul 28, 2017 · Viewed 8.2k times · Source

I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc structure.

text = u"This document is only an example. " \
    "I would like to create a custom pipeline that will remove specific tokesn from the final document."

doc = nlp(text)

def keep_token(tok):
    # This is only an example rule
    return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}

final_tokens = list(filter(keep_token, doc))

# How to get a spacy.Doc from final_tokens?

I tried to reconstruct a new spaCy Doc from the tokens lists but the API is not clear how to do it.

Answer

gdaras picture gdaras · Oct 1, 2018

I am pretty sure that you have found your solution till now but because it is not posted here I thought it may be useful to add it.

You can remove tokens by converting doc to numpy array, removing from numpy array and then converting back to doc.

Code:

import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy

def remove_tokens_on_match(doc):
    indexes = []
    for index, token in enumerate(doc):
        if (token.pos_  in ('PUNCT', 'NUM', 'SYM')):
            indexes.append(index)
    np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
    np_array = numpy.delete(np_array, indexes, axis = 0)
    doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
    doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
    return doc2

# load english model
nlp  = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))

You can look to a similar question that I answered here.