Top "Web-crawler" questions

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

unknown command: crawl error

I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it …

python scrapy web-crawler
Writing items to a MySQL database in Scrapy

I am new to Scrapy, I had the spider code class Example_spider(BaseSpider): name = "example" allowed_domains = ["www.example.…

mysql scrapy pipeline web-crawler
Syntax error, insert "... VariableDeclaratorId" to complete FormalParameterList

I am facing some issues with this code: import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.…

java web-crawler crawler4j
crawler vs scraper

Can somebody distinguish between a crawler and scraper in terms of scope and functionality.

web-crawler terminology scraper
Detect Search Crawlers via JavaScript

I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want …

javascript web-crawler bots
How to crawl Facebook based on friendship information?

I'm a graduate student whose research is complex network. I am working on a project that involves analyzing connections between …

facebook social-networking web-crawler
How to Stop the page loading in firefox programmatically?

I am running several tests with WebDriver and Firefox. I'm running into a problem with the following command: WebDriver.get(…

firefox selenium web-crawler ghostdriver
Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. …

format web-crawler robots.txt agents
Automated link-checker for system testing

I often have to work with fragile legacy websites that break in unexpected ways when logic or configuration are updated. …

automated-tests web-crawler system-testing
Ban robots from website

my website is often down because a spider is accessying to many resources. This is what the hosting told me. …

bots robots.txt web-crawler