Top "Web-crawler" questions

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

crawl links of sitemap.xml through wget command

I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget …

wget web-crawler sitemap.xml
Can Scrapy be replaced by pyspider?

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, …

python web-scraping scrapy web-crawler pyspider
How to get casper.js http.status code?

I have simple code below: var casper = require("casper").create({ }), utils = require('utils'), http = require('http'), fs = require('fs'); casper.…

javascript node.js web-crawler phantomjs casperjs
Protecting email addresses from spam bots / web crawlers

How do you prevent emails being gathered from web pages by email spiders? Does mailto: linking them increase the likelihood …

web-crawler spam spam-prevention email-spam
Google wont read my robots.txt on s3

As google is crawling our static content (stored on s3) we created a robots.txt in root directory (of the …

amazon-s3 web-crawler robots.txt googlebot
Google crawl error with HTTP_ACCEPT_LANGUAGE

In my Codeigniter app I use $_SERVER['HTTP_ACCEPT_LANGUAGE'] to determine the users browser language to set the app …

php web-crawler googlebot http-accept-language
get out links from nutch

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating …

web-crawler nutch
Load HTML string into DOM tree with Javascript

I'm currently working with an automation framework that is pulling a webpage down for analysis, which is then presented as …

javascript dom web-crawler rhino web-scraping
Fast internet crawler

I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I …

python multithreading web-crawler web-mining
Python Scrapy on offline (local) data

I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How?

python scrapy web-crawler