Converting plural to singular in a text file with Python

theintern picture theintern · Jul 13, 2015 · Viewed 29.5k times · Source

I have txt files that look like this:

word, 23
Words, 2
test, 1
tests, 4

And I want them to look like this:

word, 23
word, 2
test, 1
test, 4

I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:

import nltk

f = raw_input("Please enter a filename: ")

def openfile(f):
    with open(f,'r') as a:
       a = a.read()
       a = a.lower()
       return a

def stem(a):
    p = nltk.PorterStemmer()
    [p.stem(word) for word in a]
    return a

def returnfile(f, a):
    with open(f,'w') as d:
        d = d.write(a)
    #d.close()

print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))

I have also tried these 2 definitions instead of the stem definition:

def singular(a):
    for line in a:
        line = line[0]
        line = str(line)
        stemmer = nltk.PorterStemmer()
        line = stemmer.stem(line)
        return line

def stem(a):
    for word in a:
        for suffix in ['s']:
            if word.endswith(suffix):
                return word[:-len(suffix)]
            return word

Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:

word, 25
test, 5

I'm not sure how to do that. A solution would be nice but not necessary.

Answer

Albyorix picture Albyorix · Dec 30, 2016

If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern :

from pattern.text.en import singularize

plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
           'families', 'dogs', 'child', 'wolves']

singles = [singularize(plural) for plural in plurals]
print(singles)

returns:

>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']

It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization