I'm trying to scrape a really long web page with beautifulsoup4 and python3. Due to the size of the website, http.client
throws me an error when I try to search for something in the website:
File "/anaconda3/lib/python3.6/http/client.py", line 456, in read return self._readall_chunked() File "/anaconda3/lib/python3.6/http/client.py", line 570, in _readall_chunked raise IncompleteRead(b''.join(value)) http.client.IncompleteRead: IncompleteRead(16109 bytes read)
Is there any way to get around this error?
As the docs for http.client
tell you right at the top, this is a very low-level library, meant primarily to support urllib
, and:
See also The Requests package is recommended for a higher-level HTTP client interface.
If you can conda install requests
or pip install requests
, your problem becomes trivial:
import requests
req = requests.get('https://www.worldcubeassociation.org/results/events.php?eventId=222®ionId=&years=&show=All%2BPersons&average=Average')
soup = BeautifulSoup(req.text, 'lxml')
If you can't install a third-party library, working around this is possible, but not actually supported, and not easy. None of the chunk-handling code in http.client
is public or documented, but the docs do link you to the source, where you can see the private methods. In particular, notice that read
calls a method named _readall_chunked
, which loops over calling a _safe_read
method on _get_chunk_left
. That _safe_read
method is the code you'll need to replace (e.g., by subclassing HTTPResponse
, or monkeypatching it) to work around this problem. Which probably isn't going to be nearly as easy or fun as just using a higher-level library.