CrawlerProcess vs CrawlerRunner

Question 1

CrawlerProcess vs CrawlerRunner

python web-scraping scrapy

alecxe · Sep 26, 2016 · Viewed 10.1k times · Source

Answer

Answer

Scrapy's documentation does a pretty bad job at giving examples on real applications of both.

CrawlerProcess assumes that scrapy is the only thing that is going to use twisted's reactor. If you are using threads in python to run other code this isn't always true. Let's take this as an example.

from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
    """do something that isn't thread-safe"""
    # ...
class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop

Now, as you can see, the function will only get executed when the crawlers stop, what if I want the function to be executed while the crawlers crawl in the same reactor?

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy

def notThreadSafe(x):
    """do something that isn't thread-safe"""
    # ...

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function

The Runner class is not limited to this functionality, you may want some custom settings on your reactor (defer, threads, getPage, custom error reporting, etc)

Question 2

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script:

using CrawlerProcess
using CrawlerRunner

What is the difference between the two? When should I use "process" and when "runner"?

CrawlerProcess vs CrawlerRunner

Answer

Related questions