Future exception was never retrieved

Charnel picture Charnel · Dec 14, 2016 · Viewed 8.9k times · Source

I have a scraper (based on Python 3.4.2 and asyncio/aiohttp libs) and bunch of links (> 10K) to retrive some small amount of data. Part of scraper code:

@asyncio.coroutine
def prepare(self, links):
    semaphore = asyncio.Semaphore(self.limit_concurrent)
    tasks = []
    result = []

    tasks = [self.request_data(link, semaphore) for link in links]

    for task in asyncio.as_completed(tasks):
        response = yield from task
        if response:
            result.append(response)
        task.close()
    return result

@asyncio.coroutine
def request_data(self, link, semaphore):

    ...

    with (yield from semaphore):
        while True:
            counter += 1
            if counter >= self.retry:
                break
            with aiohttp.Timeout(self.timeout):
                try:
                    response = yield from self.session.get(url, headers=self.headers)
                    body = yield from response.read()
                    break
                except asyncio.TimeoutError as err:
                    logging.warning('Timeout error getting {0}'.format(url))
                    return None
                except Exception:
                    return None
    ...

Whan it trying to make requests to malformed URL's I get messages like this:

Future exception was never retrieved
future: <Future finished exception=gaierror(11004, 'getaddrinfo failed')>
Traceback (most recent call last):
  File "H:\Python_3_4_2\lib\concurrent\futures\thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "H:\Python_3_4_2\lib\socket.py", line 530, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11004] getaddrinfo failed

The error occures when trying to yield response from session.get. As I understand the exception was never consumed by asyncio and so it wasn't "babble up".

First I tryed to simply wrap request by try/except:

try:
    response = yield from self.session.get(url, headers=self.headers)
except Exception:
    return None

This doesn't work.

Then I read here about chaining coroutines to catch exception but this didn't work for me either. I still get those messages and script crashes after certain amount of time.

So my question - how I can handle this exception in proper way?

Answer

user7296055 picture user7296055 · Dec 14, 2016

not an answer to your question, but perhaps a solution to your problem either way depending on if you just want to get the code working or not.

I would validate the URLS before i request them. i've had alot of headaches with this kind of stuff trying to harvest some data, so i decided to fix them upfront ,and report malformed urls to a log.

You can use django's regex or other code to do this as it's publicly availible.

In this question a person gives the validation regex for django. Python - How to validate a url in python ? (Malformed or not)