Split \xef\xbb\xbf in a list read from a file

Bakke Medina picture Bakke Medina · Dec 16, 2015 · Viewed 10.4k times · Source

I tried to read big data file.txt and split all the comma, point, etc, so I read the file with this code in Python:

file= open("file.txt","r")
importantWords =[]
for i in file.readlines():
    line = i[:-1].split(" ")
    for word in line:
        for j in word:
            word = re.sub('[\!@#$%^&*-/,.;:]','',word)
            word.lower()
        if word not in stopwords.words('spanish'):
            importantWords.append(word)
print importantWords

and it printed ['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn'].

How can I clean that \xef\xbb\xbf? I'm using Python 2.7.

Answer

falsetru picture falsetru · Dec 16, 2015

It's UTF-8 encoded BOM.

>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'

You can use codecs.open with encoding='utf-8-sig' to skip the BOM sequence:

with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
    for line in f:
        ...

SIDENOTE: Instead of using file.readlines, just iterate over the file. file.readlines will create unnecessary temporary list if what you want is just iterate over the file.