Using phantomjs for dynamic content with scrapy and selenium possible race condition

rocktheartsm4l picture rocktheartsm4l · Jul 25, 2014 · Viewed 7.5k times · Source

First off, this is a follow up question from here: Change number of running spiders scrapyd

I'm used phantomjs and selenium to create a downloader middleware for my scrapy project. It works well and hasn't really slowed things down when I run my spiders one at a time locally.

But just recently I put a scrapyd server up on AWS. I noticed a possible race condition that seems to be causing errors and performance issues when more than one spider is running at once. I feel like the problem stems from two separate issues.

1) Spiders trying to use phantomjs executable at the same time.

2) Spiders trying to log to phantomjs's ghostdriver log file at the same time.

Guessing here, the performance issue may be the spider trying to wait until the resources are available (this could be due to the fact that I also had a race condition for an sqlite database as well).

Here are the errors I get:

exceptions.IOError: [Errno 13] Permission denied: 'ghostdriver.log' (log file race condition?)

selenium.common.exceptions.WebDriverException: Message: 'Can not connect to GhostDriver' (executable race condition?)

My questions are:

Does my analysis of what the problem(s) are seem correct?

Are there any known solutions to this problem other than limiting the number of spiders that can be ran at a time?

Is there some other way I should be handling javascript? (if you think I should create an entirely new question to discuss the best way to handle javascript with scrapy let me know and I will)

Here is my downloader middleware:

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        if _platform == "linux" or _platform == "linux2":
            driver = webdriver.PhantomJS(service_log_path='/var/log/scrapyd/ghost.log')
        else:
            driver = webdriver.PhantomJS(executable_path=settings.PHANTOM_JS_PATH)
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

note: the _platform code is a temporary work around until I get this source code deployed into a static environment.

I found solutions on SO for javascript problem but they were spider based. This bothered me because it meant every request had to be made once in the downloader handler and again in the spider. That is why I decided to implement mine as a downloader middleware.

Answer

Eric Hartford picture Eric Hartford · Oct 20, 2014

try using webdriver to interface with phantomjs https://github.com/brandicted/scrapy-webdriver