I'm trying import a text file and save it on my desktop, but the text is in "utf-8" (there is this information in the book), so when I save without encoding the text has many strange characters, but when I try to save with explicit encoding this error appears:
Traceback (most recent call last): File "C:/Users/Unidas/Semestre/ABC/8.1.py", line 14, in n_palabras libro.write(archivo.read()) TypeError: write() argument 1 must be unicode, not str
The code:
def n_palabras(x):
import urllib2
import io
import string
archivo = urllib2.urlopen(x)
libro = io.open("alice.txt", "w", encoding="utf8")
libro.write(archivo.read())
libro.close()
How can I save this file with encoding utf-8? I'm using Pycharm with Python 2.7
Your problem is that urlopen
returns a bytes-oriented file-like object, while io.open
expects true text inputs (where "text" means "unicode
on Python 2, str
on Python 3").
The only thing you need to change is to decode
the result of calling read
; it's bytes-like by default, and you need true text. You need to figure out the correct encoding (either hard-coding it, or explicitly inspecting the headers to figure it out) to decode it correctly (it's likely either UTF-8 or, much less likely, cp1252, but it could be something weird).
In any event, knowing that, the only change you'd need to make is to change:
libro.write(archivo.read())
to:
libro.write(archivo.read().decode(knownencoding))
If you're pretty sure the server is always providing UTF-8 output, then:
libro.write(archivo.read().decode('utf-8'))
is sufficient. Yes, it's mildly wasteful (you decode it only to write it to a stream that immediately reencodes it), but importantly, this gives you a guarantee that the bytes you received were interpretable as valid UTF-8, which dumping the raw bytes to disk won't guarantee.
A more elaborate solution inspects the headers:
import urllib2
import io
import string
def n_palabras(x):
archivo = urllib2.urlopen(x)
# Find charset in headers, if it exists
for p in archivo.headers.plist:
key, sep, value = p.partition('=')
if sep and key.strip().lower() == 'charset':
encoding = value.strip()
break
else:
encoding = 'utf-8'
data = archivo.read()
try:
# Try to use parsed charset
data = data.decode(encoding)
except UnicodeDecodeError:
# If that fails, try UTF-8 as fallback; let exception bubble
# if this fails too
data = data.decode('utf-8')
with io.open("alice.txt", "w", encoding="utf-8") as libro:
libro.write(data)