Scrapy like tool for Nodejs?

user2422940 picture user2422940 · Oct 30, 2014 · Viewed 8k times · Source

I would like to know if there is something like Scrapy for nodejs ?. if not what do you think of using the simple page download and parsing it using cheerio ? is there a better way.

Answer

pguardiario picture pguardiario · May 16, 2019

Scrapy is a library that adds asynchronous IO to python. The reason we don't have something like that for node is because all IO is already asynchronous (unless you need it not to be).

Here's what a scrapy script might look like in node and notice that the urls are processed concurrently.

const cheerio = require('cheerio');
const axios = require('axios');

const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']

// this might be called a "middleware" in scrapy.
const get = async url => {
  const response = await axios.get(url)
  return cheerio.load(response.data)
}

// this too.
const output = item => {
  console.log(item)
}

// here is parse which is the initial scrapy callback
const parse = async url => {
  const $ = await get(url)
  output({url, title: $('title').text()})
}

// and here is the main execution. We wrap it in an async function to allow await.
;(async function(){
  await Promise.all(
    startUrls.map(url => parse(url))
  )
})()