Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

user168983 picture user168983 · Dec 26, 2014 · Viewed 18k times · Source

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
Traceback (most recent call last):
  File "prog_w2v.py", line 7, in <module>
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
    header = utils.to_unicode(fin.readline())
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim
import time
start = time.time()    
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
end = time.time()   
print end-start,"   seconds"

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

Answer

batgirl picture batgirl · May 12, 2015

You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')