How to iterate through sentence of string in Python?

ChamingaD picture ChamingaD · May 8, 2012 · Viewed 16.9k times · Source

Assume I have a string text = "A compiler translates code from a source language". I want to do two things:

  1. I need to iterate through each word and stem using the NLTK library. The function for stemming is PorterStemmer().stem_word(word). We have to pass the argument 'word'. How can I stem each word and get back the stemmed sentence?

  2. I need to remove certain stop words from the text string. The list containing the stop words is stored in a text file (space separated)

    stopwordsfile = open('c:/stopwordlist.txt','r+')
    stopwordslist=stopwordsfile.read()
    

    How can I remove those stop words from text and get a cleaned new string?

Answer

Gareth Latty picture Gareth Latty · May 8, 2012

I posted this as a comment, but thought I might as well flesh it out into a full answer with some explanation:

You want to use str.split() to split the string into words, and then stem each word:

for word in text.split(" "):
    PorterStemmer().stem_word(word)

As you want to get a string of all the stemmed words together, it's trivial to then join these stems back together. To do this easily and efficiently we use str.join() and a generator expression:

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))

Edit:

For your other problem:

with open("/path/to/file.txt") as f:
    words = set(f)

Here we open the file using the with statement (which is the best way to open files, as it handles closing them correctly, even on exceptions, and is more readable) and read the contents into a set. We use a set as we don't care about the order of the words, or duplicates, and it will be more efficient later. I am presuming one word per line - if this isn't the case, and they are comma separated, or whitespace separated then using str.split() as we did before (with appropriate arguments) is probably a good plan.

stems = (PorterStemmer().stem_word(word) for word in text.split(" "))
" ".join(stem for stem in stems if stem not in words)

Here we use the if clause of a generator expression to ignore words that are in the set of words we loaded from a file. Membership checks on a set are O(1), so this should be relatively efficient.

Edit 2:

To remove the words before they are stemmed, it's even simpler:

" ".join(PorterStemmer().stem_word(word) for word in text.split(" ") if word not in words)

The removal of the given words is simply:

filtered_words = [word for word in unfiltered_words if not in set_of_words_to_filter]