mycorpus.txt
Human where's machine interface for lab abc computer applications
A where's survey of user opinion of computer system response time
stopwords.txt
let's
ain't
there's
The following code
corpus = set()
for line in open("path\\to\\mycorpus.txt"):
corpus.update(set(line.lower().split()))
print corpus
stoplist = set()
for line in open("C:\\Users\\Pankaj\\Desktop\\BTP\\stopwords_new.txt"):
stoplist.add(line.lower().strip())
print stoplist
gives the following output
set(['a', "where's", 'abc', 'for', 'of', 'system', 'lab', 'machine', 'applications', 'computer', 'survey', 'user', 'human', 'time', 'interface', 'opinion', 'response'])
set(['let\x92s', 'ain\x92t', 'there\x92s'])
Why is the apostrophe turning into \x92 in the 2nd set??
Code point 92(hex) in window-1252 encoding is Unicode code point 2019(hex) which is 'RIGHT SINGLE QUOTATION MARK'. This looks very like an apostrophe and is likely to be the actual character that you have in stopwords.txt
, which I've guessed from the way python has interpreted in, has be encoded in windows-1252 or an encoding that shares ASCII and ’
codepoint values.
' vs ’