Retrieve multiple urls at once/in parallel

DominiCane picture DominiCane · Aug 20, 2010 · Viewed 12.2k times · Source

Possible Duplicate:
How can I speed up fetching pages with urllib2 in python?

I have a python script that download web page, parse it and return some value from the page. I need to scrape a few such pages for getting the final result. Every page retrieve takes long time (5-10s) and I'd prefer to make requests in parallel to decrease wait time.
The question is - which mechanism will do it quick, correctly and with minimal CPU/Memory waste? Twisted, asyncore, threading, something else? Could you provide some link with examples?
Thanks

UPD: There's a few solutions for the problem, I'm looking for the compromise between speed and resources. If you could tell some experience details - how it's fast under load from your view, etc - it would be very helpful.

Answer

pygabriel picture pygabriel · Aug 20, 2010

multiprocessing.Pool can be a good deal, there are some useful examples. For example if you have a list of urls, you can map the contents retrieval in a concurrent way:

def process_url(url):
    # Do what you want
    return what_you_want

pool = multiprocessing.Pool(processes=4) # how much parallelism?
pool.map(process_url, list_of_urls)