How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

Batman picture Batman · Apr 17, 2016 · Viewed 14.2k times · Source

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery)

Answer

Alejandro Moreno picture Alejandro Moreno · Jul 20, 2017

You want to have a look at phantomjs. There is this php implementation:

http://jonnnnyw.github.io/php-phantomjs/

if you need to have it working with php of course.

You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for contents, etc...). That would depend on your needs, maybe you can simply use the dom, like this:

How to get element by class name?

Here is some working code.

  $content = $this->getHeadlessReponse($url);
  $this->crawler->addContent($this->getHeadlessReponse($url));

  /**
   * Get response using a headless browser (phantom in this case).
   *
   * @param $url
   *   URL to fetch headless
   *
   * @return string
   *   Response.
   */
public function getHeadlessReponse($url) {
    // Fetch with phamtomjs
    $phantomClient = PhantomClient::getInstance();
    // and feed into the crawler.
    $request = $phantomClient->getMessageFactory()->createRequest($url, 'GET');

    /**
     * @see JonnyW\PhantomJs\Http\Response
     **/
    $response = $phantomClient->getMessageFactory()->createResponse();

    // Send the request
    $phantomClient->send($request, $response);

    if($response->getStatus() === 200) {
        // Dump the requested page content
        return $response->getContent();
    }

}

Only disadvantage of using phantom, it will be slower than guzzle, but of course, you have to wait for all those pesky js to be loaded.