Scrapy crawl from script always blocks script execution after scraping

Eugene Nagorny picture Eugene Nagorny · Feb 8, 2013 · Viewed 9k times · Source

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script:

    crawler = Crawler(Settings(settings))
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()
    print "It can't be printed out!"

It works at it should: visits pages, scrape needed info and stores output json where I told it(via FEED_URI). But when spider finishing his work(I can see it by number in output json) execution of my script wouldn't resume. Probably it isn't scrapy problem. And answer should somewhere in twisted's reactor. How could I release thread execution?

Answer

Steven Almeroth picture Steven Almeroth · Feb 10, 2013

You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed signal:

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher

from testspiders.spiders.followall import FollowAllSpider

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

And the command line log output might look something like:

stav@maia:/srv/scrapy/testspiders$ ./api
2013-02-10 14:49:38-0600 [scrapy] INFO: Running reactor...
2013-02-10 14:49:47-0600 [followall] INFO: Closing spider (finished)
2013-02-10 14:49:47-0600 [followall] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 23934,...}
2013-02-10 14:49:47-0600 [followall] INFO: Spider closed (finished)
2013-02-10 14:49:47-0600 [scrapy] INFO: Reactor stopped.
stav@maia:/srv/scrapy/testspiders$