Top "Web-crawler" questions

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Save complete web page (incl css, images) using python/selenium

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page …

python selenium web-scraping web-crawler bioinformatics
Robots.txt not working

I have used robots.txt to restrict one of the folders in my site. The folder consists of the sites …

robots.txt web-crawler
Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a …

python web-crawler scrapy scrapy-spider google-crawlers
Getting Started with Python: Attribute Error

I am new to python and just downloaded it today. I am using it to work on a web spider, …

python web-crawler attributeerror chilkat
Nutch regex-urlfilter syntax

I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax …

regex web-crawler nutch
Are Meta Keywords Case Sensitive?

Is <meta name="keywords" content="mykeyword, Mykeyword"> the same thing as <meta name="keywords" content="mykeyword"> …

html seo web-crawler meta-tags
Python urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems

I've written a crawler that uses urllib2 to fetch URLs. every few requests I get some weird behaviors, I've tried …

python exception web-crawler urllib2 errno
Exclude bots and spiders from a View counter in PHP

I have built a pretty basic advertisement manager for a website in PHP. I say basic because it's not complex …

php ads web-crawler
Ruby on Rails, How to determine if a request was made by a robot or search engine spider?

I've Rails apps, that record an IP-address from every request to specific URL, but in my IP database i've found …

ruby-on-rails ruby-on-rails-3 search-engine web-crawler
How to disable robots.txt when you launch scrapy shell?

I use Scrapy shell without problems with several websites, but I find problems when the robots (robots.txt) does not …

python scrapy web-crawler robots.txt scrapy-shell