http.client.IncompleteRead error in Python3

Evan Hsueh picture Evan Hsueh · Jul 7, 2018 · Viewed 7.3k times · Source

I'm trying to scrape a really long web page with beautifulsoup4 and python3. Due to the size of the website, http.client throws me an error when I try to search for something in the website:

File "/anaconda3/lib/python3.6/http/client.py", line 456, in read return self._readall_chunked() File "/anaconda3/lib/python3.6/http/client.py", line 570, in _readall_chunked raise IncompleteRead(b''.join(value)) http.client.IncompleteRead: IncompleteRead(16109 bytes read)

Is there any way to get around this error?

Answer

abarnert picture abarnert · Jul 7, 2018

As the docs for http.client tell you right at the top, this is a very low-level library, meant primarily to support urllib, and:

See also The Requests package is recommended for a higher-level HTTP client interface.

If you can conda install requests or pip install requests, your problem becomes trivial:

import requests
req = requests.get('https://www.worldcubeassociation.org/results/events.php?eventId=222&regionId=&years=&show=All%2BPersons&average=Average')
soup = BeautifulSoup(req.text, 'lxml')

If you can't install a third-party library, working around this is possible, but not actually supported, and not easy. None of the chunk-handling code in http.client is public or documented, but the docs do link you to the source, where you can see the private methods. In particular, notice that read calls a method named _readall_chunked, which loops over calling a _safe_read method on _get_chunk_left. That _safe_read method is the code you'll need to replace (e.g., by subclassing HTTPResponse, or monkeypatching it) to work around this problem. Which probably isn't going to be nearly as easy or fun as just using a higher-level library.