I’m having difficulty eliminating and tokenizing a .text file using nltk
. I keep getting the following AttributeError: 'list' object has no attribute 'lower'
.
I just can’t figure out what I’m doing wrong, although it’s my first time of doing something like this. Below are my lines of code.I’ll appreciate any suggestions, thanks
import nltk
from nltk.corpus import stopwords
s = open("C:\zircon\sinbo1.txt").read()
tokens = nltk.word_tokenize(s)
def cleanupDoc(s):
stopset = set(stopwords.words('english'))
tokens = nltk.word_tokenize(s)
cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]
return cleanup
cleanupDoc(s)
You can use the stopwords
lists from NLTK, see How to remove stop words using nltk or python.
And most probably you would also like to strip off punctuation, you can use string.punctuation
, see http://docs.python.org/2/library/string.html:
>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = set(stopwords.words('english') + list(string.punctuation))
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']