How to extend the stopword list from NLTK and remove stop words with the extended list?

jxn picture jxn · Mar 26, 2015 · Viewed 10.6k times · Source

I have tried two ways of removing stopwords, both of which I run into issues:

Method 1:

cachedStopWords = stopwords.words("english")
words_to_remove = """with some your just have from it's /via & that they your there this into providing would can't"""
remove = tu.removal_set(words_to_remove, query)
remove2 = tu.removal_set(cachedStopWords, query)

In this case, only the first remove function works. remove2 doesn't work.

Method 2:

lines = tu.lines_cleanup([sentence for sentence in sentence_list], remove=remove)
words = '\n'.join(lines).split()
print words # list of words

output looks like this ["Hello", "Good", "day"]

I try to remove stopwords from words. This is my code:

for word in words:
    if word in cachedStopwords:
        continue
    else:
        new_words='\n'.join(word)

print new_words

The output looks like this:

H
e
l
l
o

Cant figure out what is wrong with the above 2 methods. Please advice.

Answer

Akash Kandpal picture Akash Kandpal · Jul 6, 2018

Use this for increasing the stopword list :

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
print(len(stop_words))

Output:

179

184