Stopping scripters from slamming your website

scripting e-commerce bots detection

Dave Rutledge · Jan 16, 2009 · Viewed 88.5k times · Source

I've accepted an answer, but sadly, I believe we're stuck with our original worst case scenario: CAPTCHA everyone on purchase attempts of the crap. Short explanation: caching / web farms make it impossible to track hits, and any workaround (sending a non-cached web-beacon, writing to a unified table, etc.) slows the site down worse than the bots would. There is likely some pricey hardware from Cisco or the like that can help at a high level, but it's hard to justify the cost if CAPTCHA-ing everyone is an alternative. I'll attempt a more full explanation later, as well as cleaning this up for future searchers (though others are welcome to try, as it's community wiki).

Situation

This is about the bag o' crap sales on woot.com. I'm the president of Woot Workshop, the subsidiary of Woot that does the design, writes the product descriptions, podcasts, blog posts, and moderates the forums. I work with CSS/HTML and am only barely familiar with other technologies. I work closely with the developers and have talked through all of the answers here (and many other ideas we've had).

Usability is a massive part of my job, and making the site exciting and fun is most of the rest of it. That's where the three goals below derive. CAPTCHA harms usability, and bots steal the fun and excitement out of our crap sales.

Bots are slamming our front page tens of times a second screen scraping (and/or scanning our RSS) for the Random Crap sale. The moment they see that, it triggers a second stage of the program that logs in, clicks I want One, fills out the form, and buys the crap.

Evaluation

lc: On stackoverflow and other sites that use this method, they're almost always dealing with authenticated (logged in) users, because the task being attempted requires that.

On Woot, anonymous (non-logged) users can view our home page. In other words, the slamming bots can be non-authenticated (and essentially non-trackable except by IP address).

So we're back to scanning for IPs, which a) is fairly useless in this age of cloud networking and spambot zombies and b) catches too many innocents given the number of businesses that come from one IP address (not to mention the issues with non-static IP ISPs and potential performance hits to trying to track this).

Oh, and having people call us would be the worst possible scenario. Can we have them call you?

BradC: Ned Batchelder's methods look pretty cool, but they're pretty firmly designed to defeat bots built for a network of sites. Our problem is bots are built specifically to defeat our site. Some of these methods could likely work for a short time until the scripters evolved their bots to ignore the honeypot, screen-scrape for nearby label names instead of form ids, and use a javascript-capable browser control.

lc again: "Unless, of course, the hype is part of your marketing scheme." Yes, it definitely is. The surprise of when the item appears, as well as the excitement if you manage to get one is probably as much or more important than the crap you actually end up getting. Anything that eliminates first-come/first-serve is detrimental to the thrill of 'winning' the crap.

novatrust: And I, for one, welcome our new bot overlords. We actually do offer RSSfeeds to allow 3rd party apps to scan our site for product info, but not ahead of the main site HTML. If I'm interpreting it right, your solution does help goal 2 (performance issues) by completely sacrificing goal 1, and just resigning the fact that bots will be buying most of the crap. I up-voted your response, because your last paragraph pessimism feels accurate to me. There seems to be no silver bullet here.

The rest of the responses generally rely on IP tracking, which, again, seems to both be useless (with botnets/zombies/cloud networking) and detrimental (catching many innocents who come from same-IP destinations).

Any other approaches / ideas? My developers keep saying "let's just do CAPTCHA" but I'm hoping there's less intrusive methods to all actual humans wanting some of our crap.

Original question

Say you're selling something cheap that has a very high perceived value, and you have a very limited amount. No one knows exactly when you will sell this item. And over a million people regularly come by to see what you're selling.

You end up with scripters and bots attempting to programmatically [a] figure out when you're selling said item, and [b] make sure they're among the first to buy it. This sucks for two reasons:

Your site is slammed by non-humans, slowing everything down for everyone.
The scripters end up 'winning' the product, causing the regulars to feel cheated.

A seemingly obvious solution is to create some hoops for your users to jump through before placing their order, but there are at least three problems with this:

The user experience sucks for humans, as they have to decipher CAPTCHA, pick out the cat, or solve a math problem.
If the perceived benefit is high enough, and the crowd large enough, some group will find their way around any tweak, leading to an arms race. (This is especially true the simpler the tweak is; hidden 'comments' form, re-arranging the form elements, mis-labeling them, hidden 'gotcha' text all will work once and then need to be changed to fight targeting this specific form.)
Even if the scripters can't 'solve' your tweak it doesn't prevent them from slamming your front page, and then sounding an alarm for the scripter to fill out the order, manually. Given they get the advantage from solving [a], they will likely still win [b] since they'll be the first humans reaching the order page. Additionally, 1. still happens, causing server errors and a decreased performance for everyone.

Another solution is to watch for IPs hitting too often, block them from the firewall, or otherwise prevent them from ordering. This could solve 2. and prevent [b] but the performance hit from scanning for IPs is massive and would likely cause more problems like 1. than the scripters were causing on their own. Additionally, the possibility of cloud networking and spambot zombies makes IP checking fairly useless.

A third idea, forcing the order form to be loaded for some time (say, half a second) would potentially slow the progress of the speedy orders, but again, the scripters would still be the first people in, at any speed not detrimental to actual users.

Goals

Sell the item to non-scripting humans.
Keep the site running at a speed not slowed by bots.
Don't hassle the 'normal' users with any tasks to complete to prove they're human.

Answer

How about implementing something like SO does with the CAPTCHAs?

If you're using the site normally, you'll probably never see one. If you happen to reload the same page too often, post successive comments too quickly, or something else that triggers an alarm, make them prove they're human. In your case, this would probably be constant reloads of the same page, following every link on a page quickly, or filling in an order form too fast to be human.

If they fail the check x times in a row (say, 2 or 3), give that IP a timeout or other such measure. Then at the end of the timeout, dump them back to the check again.

Since you have unregistered users accessing the site, you do have only IPs to go on. You can issue sessions to each browser and track that way if you wish. And, of course, throw up a human-check if too many sessions are being (re-)created in succession (in case a bot keeps deleting the cookie).

As far as catching too many innocents, you can put up a disclaimer on the human-check page: "This page may also appear if too many anonymous users are viewing our site from the same location. We encourage you to register or login to avoid this." (Adjust the wording appropriately.)

Besides, what are the odds that X people are loading the same page(s) at the same time from one IP? If they're high, maybe you need a different trigger mechanism for your bot alarm.

Edit: Another option is if they fail too many times, and you're confident about the product's demand, to block them and make them personally CALL you to remove the block.

Having people call does seem like an asinine measure, but it makes sure there's a human somewhere behind the computer. The key is to have the block only be in place for a condition which should almost never happen unless it's a bot (e.g. fail the check multiple times in a row). Then it FORCES human interaction - to pick up the phone.

In response to the comment of having them call me, there's obviously that tradeoff here. Are you worried enough about ensuring your users are human to accept a couple phone calls when they go on sale? If I were so concerned about a product getting to human users, I'd have to make this decision, perhaps sacrificing a (small) bit of my time in the process.

Since it seems like you're determined to not let bots get the upper hand/slam your site, I believe the phone may be a good option. Since I don't make a profit off your product, I have no interest in receiving these calls. Were you to share some of that profit, however, I may become interested. As this is your product, you have to decide how much you care and implement accordingly.

The other ways of releasing the block just aren't as effective: a timeout (but they'd get to slam your site again after, rinse-repeat), a long timeout (if it was really a human trying to buy your product, they'd be SOL and punished for failing the check), email (easily done by bots), fax (same), or snail mail (takes too long).

You could, of course, instead have the timeout period increase per IP for each time they get a timeout. Just make sure you're not punishing true humans inadvertently.