TypeError: write() argument 1 must be unicode, not str

Ana Cecília Vieira picture Ana Cecília Vieira · Oct 10, 2018 · Viewed 21.1k times · Source

I'm trying import a text file and save it on my desktop, but the text is in "utf-8" (there is this information in the book), so when I save without encoding the text has many strange characters, but when I try to save with explicit encoding this error appears:

Traceback (most recent call last):

File "C:/Users/Unidas/Semestre/ABC/8.1.py", line 14, in n_palabras

libro.write(archivo.read())

TypeError: write() argument 1 must be unicode, not str

The code:

def n_palabras(x):
    import urllib2
    import io
    import string

    archivo = urllib2.urlopen(x)
    libro = io.open("alice.txt", "w", encoding="utf8")
    libro.write(archivo.read())
    libro.close()

How can I save this file with encoding utf-8? I'm using Pycharm with Python 2.7

Answer

ShadowRanger picture ShadowRanger · Oct 10, 2018

Your problem is that urlopen returns a bytes-oriented file-like object, while io.open expects true text inputs (where "text" means "unicode on Python 2, str on Python 3").

The only thing you need to change is to decode the result of calling read; it's bytes-like by default, and you need true text. You need to figure out the correct encoding (either hard-coding it, or explicitly inspecting the headers to figure it out) to decode it correctly (it's likely either UTF-8 or, much less likely, cp1252, but it could be something weird).

In any event, knowing that, the only change you'd need to make is to change:

libro.write(archivo.read())

to:

libro.write(archivo.read().decode(knownencoding))

If you're pretty sure the server is always providing UTF-8 output, then:

libro.write(archivo.read().decode('utf-8'))

is sufficient. Yes, it's mildly wasteful (you decode it only to write it to a stream that immediately reencodes it), but importantly, this gives you a guarantee that the bytes you received were interpretable as valid UTF-8, which dumping the raw bytes to disk won't guarantee.

A more elaborate solution inspects the headers:

import urllib2
import io
import string

def n_palabras(x):
    archivo = urllib2.urlopen(x)

    # Find charset in headers, if it exists    
    for p in archivo.headers.plist:
        key, sep, value = p.partition('=')
        if sep and key.strip().lower() == 'charset':
           encoding = value.strip()
           break
    else:
        encoding = 'utf-8'

    data = archivo.read()

    try:
        # Try to use parsed charset
        data = data.decode(encoding)
    except UnicodeDecodeError:
        # If that fails, try UTF-8 as fallback; let exception bubble
        # if this fails too
        data = data.decode('utf-8')

    with io.open("alice.txt", "w", encoding="utf-8") as libro:
        libro.write(data)