The way to detect web scraping

aks picture aks · Mar 20, 2011 · Viewed 10.1k times · Source

I need to detect scraping of info on my website. I tried detection based on behavior patterns, and it seems to be promising, although relatively computing heavy.

The base is to collect request timestamps of certain client side and compare their behavior pattern with common pattern or precomputed pattern.

To be more precise, I collect time intervals between requests into array, indexed by function of time:

i = (integer) ln(interval + 1) / ln(N + 1) * N + 1
Y[i]++
X[i]++ for current client

where N is time (count) limit, intervals greater than N are dropped. Initially X and Y are filled with ones.

Then, after I got enough number of them in X and Y, it's time to make decision. Criteria is parameter C:

C = sqrt(summ((X[i]/norm(X) - Y[i]/norm(Y))^2)/k)

where X is certain client data, Y is common data, and norm() is calibration function, and k is normalization coefficient, depending on type of norm(). There are 3 types:

  1. norm(X) = summ(X)/count(X), k = 2
  2. norm(X) = sqrt(summ(X[i]^2), k = 2
  3. norm(X) = max(X[i]), k is square root of number of non-empty elements X

C is in range (0..1), 0 means there is no behavior deviation and 1 is max deviation.

Сalibration of type 1 is best for repeating requests, type 2 for repeating request with few intervals, type 3 for non-constant request intervals.

What do you think? I'll appreciate if you'll try this on your services.

Answer

rook picture rook · Mar 21, 2011

To be honest your approach is completely worthless because its trivial bypass. An attacker doesn't even have to write a line of code to bypass it. Proxy servers are free and you can boot up a new machine with a new ip address on amazon ec2 for 2 cents an hour.

A better approach is Roboo which uses cookie techniques to foil robots. The vast majority of robots can't run javascript or flash, and this can be used to your advantage.

However all of this "(in)security though obscurity", and the ONLY REASON why it might work is because your data isn't worth a programmer spending 5 minutes on it. (Roboo included)