Lemmatization of a list of words

minks picture minks · Dec 3, 2015 · Viewed 9.9k times · Source

So I have a list of words in a text file. I want to perform lemmatization on them to remove words which have the same meaning but are in different tenses. Like try, tried etc. When I do this, I keep getting an error like TypeError: unhashable type: 'list'

    results=[]
    with open('/Users/xyz/Documents/something5.txt', 'r') as f:
       for line in f:
          results.append(line.strip().split())

    lemma= WordNetLemmatizer()

    lem=[]

    for r in results:
       lem.append(lemma.lemmatize(r))

    with open("lem.txt","w") as t:
      for item in lem:
        print>>t, item

How do I lemmatize words which are already tokens?

Answer

Mike Robins picture Mike Robins · Dec 3, 2015

The method WordNetLemmatizer.lemmatize is probably expecting a string but you are passing it a list of strings. This is giving you the TypeError exception.

The result of line.split() is a list of strings which you are appending as a list to results i.e. a list of lists.

You want to use results.extend(line.strip().split())

results = []
with open('/Users/xyz/Documents/something5.txt', 'r') as f:
    for line in f:
        results.extend(line.strip().split())

lemma = WordNetLemmatizer()

lem = map(lemma.lemmatize, results)

with open("lem.txt", "w") as t:
    for item in lem:
        print >> t, item

or refactored without the intermediate results list

def words(fname):
    with open(fname, 'r') as document:
        for line in document:
            for word in line.strip().split():
                yield word

lemma = WordNetLemmatizer()
lem = map(lemma.lemmatize, words('/Users/xyz/Documents/something5.txt'))