What options are there to detect web-crawlers that do not want to be detected?
(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)
I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it:
I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return.
There are some trapdoors that can be constructed updated list (thanks Chris, gs):
Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist:
robots.txt
? robots.txt
One other important thing here is:
Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing.
What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors.
Update
The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler.
Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc...
I have no illusions: I won't be able to beat them.
You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them.
And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.
A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.
I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.
Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.