How to generate bi/tri-grams using spacy/nltk

samol picture samol · Aug 31, 2016 · Viewed 13k times · Source

The input text are always list of dish names where there are 1~3 adjectives and a noun

Inputs

thai iced tea
spicy fried chicken
sweet chili pork
thai chicken curry

outputs:

thai tea, iced tea
spicy chicken, fried chicken
sweet pork, chili pork
thai chicken, chicken curry, thai curry

Basically, I am looking to parse the sentence tree and try to generate bi-grams by pairing an adjective with the noun.

And I would like to achieve this with spacy or nltk

Answer

Petr Matuska picture Petr Matuska · Feb 16, 2018

I used spacy 2.0 with english model. To find nouns and "not-nouns" to parse the input and then I put together not-nouns and nouns to create a desired output.

Your input:

s = ["thai iced tea",
"spicy fried chicken",
"sweet chili pork",
"thai chicken curry",]

Spacy solution:

import spacy
nlp = spacy.load('en') # import spacy, load model

def noun_notnoun(phrase):
    doc = nlp(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text

    for notnoun in token_not_noun:
        notnoun_noun_list.append(notnoun + " " + noun)

    return notnoun_noun_list

Call function:

for phrase in s:
    print(noun_notnoun(phrase))

Results:

['thai tea', 'iced tea']
['spicy chicken', 'fried chicken']
['sweet pork', 'chili pork']
['thai chicken', 'curry chicken']