python requests.get() returns improperly decoded text instead of UTF-8?

arunk2 picture arunk2 · May 26, 2017 · Viewed 59.2k times · Source

When the content-type of the server is 'Content-Type:text/html', requests.get() returns improperly encoded data.

However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.

Also, when we use urllib.urlopen(), it returns properly encoded data.

Has anyone noticed this before? Why does requests.get() behave like this?

Answer

bubak picture bubak · Oct 2, 2018

Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).

For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.

Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text