I tried to read big data file.txt and split all the comma, point, etc, so I read the file with this code in Python:
file= open("file.txt","r")
importantWords =[]
for i in file.readlines():
line = i[:-1].split(" ")
for word in line:
for j in word:
word = re.sub('[\!@#$%^&*-/,.;:]','',word)
word.lower()
if word not in stopwords.words('spanish'):
importantWords.append(word)
print importantWords
and it printed ['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn']
.
How can I clean that \xef\xbb\xbf
? I'm using Python 2.7.
It's UTF-8 encoded BOM.
>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
You can use codecs.open
with encoding='utf-8-sig'
to skip the BOM sequence:
with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
for line in f:
...
SIDENOTE: Instead of using file.readlines
, just iterate over the file. file.readlines
will create unnecessary temporary list if what you want is just iterate over the file.