UnicodeDecodeError when performing os.walk

Scott picture Scott · Feb 14, 2014 · Viewed 14.7k times · Source

I am getting the error:

'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)

when trying to do os.walk. The error occurs because some of the files in a directory have the 0x8b (non-utf8) character in them. The files come from a Windows system (hence the utf-16 filenames), but I have copied the files over to a Linux system and am using python 2.7 (running in Linux) to traverse the directories.

I have tried passing a unicode start path to os.walk, and all the files & dirs it generates are unicode names until it comes to a non-utf8 name, and then for some reason, it doesn't convert those names to unicode and then the code chokes on the utf-16 names. Is there anyway to solve the problem short of manually finding and changing all the offensive names?

If there is not a solution in python2.7, can a script be written in python3 to traverse the file tree and fix the bad filenames by converting them to utf-8 (by removing the non-utf8 chars)? N.B. there are many non-utf8 chars in the names besides 0x8b, so it would need to work in a general fashion.

UPDATE: The fact that 0x8b is still only a btye char (just not valid ascii) makes it even more puzzling. I have verified that there is a problem converting such a string to unicode, but that a unicode version can be created directly. To wit:

>>> test = 'a string \x8b with non-ascii'
>>> test
'a string \x8b with non-ascii'
>>> unicode(test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 9: ordinal not in  range(128)
>>> 
>>> test2 = u'a string \x8b with non-ascii'
>>> test2
u'a string \x8b with non-ascii'

Here's a traceback of the error I am getting:

80.         for root, dirs, files in os.walk(unicode(startpath)):
File "/usr/lib/python2.7/os.py" in walk
294.             for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
294.             for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
284.         if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py" in join
71.             path += '/' + b

Exception Type: UnicodeDecodeError at /admin/casebuilder/company/883/
Exception Value: 'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)

The root of the problem occurs in the list of files returned from listdir (on line 276 of os.walk):

names = listdir(top)

The names with chars > 128 are returned as non-unicode strings.

Answer

Will Rouesnel picture Will Rouesnel · Nov 17, 2014

Right I just spent some time sorting through this error, and wordier answers here aren't getting at the underlying issue:

The problem is, if you pass a unicode string into os.walk(), then os.walk starts getting unicode back from os.listdir() and tries to keep it as ASCII (hence 'ascii' decode error). When it hits a unicode only special character which str() can't translate, it throws the exception.

The solution is to force the starting path you pass to os.walk to be a regular string - i.e. os.walk(str(somepath)). This means os.listdir returns regular byte-like strings and everything works the way it should.

You can reproduce this problem (and show it's solution works) trivially like:

  1. Go into bash in some directory and run touch $(echo -e "\x8b\x8bThis is a bad filename") which will make some test files.

  2. Now run the following Python code (iPython Qt is handy for this) in the same directory:

    l = []
    for root,dir,filenames in os.walk(unicode('.')):
        l.extend([ os.path.join(root, f) for f in filenames ])
    print l
    

And you'll get a UnicodeDecodeError.

  1. Now try running:

    l = []
    for root,dir,filenames in os.walk('.'):
        l.extend([ os.path.join(root, f) for f in filenames ])
    print l
    

No error and you get a print out!

Thus the safe way in Python 2.x is to make sure you only pass raw text to os.walk(). You absolutely should not pass unicode or things which might be unicode to it, because os.walk will then choke when an internal ascii conversion fails.