I use Scrapy and I try to scrape this site that uses Incapsula
<meta name="robots" content="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">
</script>
I had already asked a Question about this issue 2 years ago, but this method (Incapsula-Cracker) does not work anymore.
I tried to understand How Incapsula works and I tried this for bypass it
def start_requests(self):
yield Request('https://courses-en-ligne.carrefour.fr', cookies={'store': 92}, dont_filter=True, callback = self.init_shop)
def init_shop(self,response) :
result_content = response.body
RE_ENCODED_FUNCTION = re.compile('var b="(.*?)"', re.DOTALL)
RE_INCAPSULA = re.compile('(_Incapsula_Resource\?SWHANEDL=.*?)"')
INCAPSULA_URL = 'https://courses-en-ligne.carrefour.fr/%s'
encoded_func = RE_ENCODED_FUNCTION.search(result_content).group(1)
decoded_func = ''.join([chr(int(encoded_func[i:i+2], 16)) for i in xrange(0, len(encoded_func), 2)])
incapsula_params = RE_INCAPSULA.search(decoded_func).group(1)
incap_url = INCAPSULA_URL % incapsula_params
yield Request(incap_url)
def parse(self):
print response.body
But i'm redirected to RE-Captcha Page
<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=27&xinfo=3-10784678-0%200NNN%20RT%281523525225370%20394%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=459000960022408474-41333502566401539&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 459000960022408474-41333502566401539
</iframe>
</body>
</html>
So first of all there is no fool proof solutions to such problems. I as a actual user end-up having to solve captcha while answering on StackOverflow. Which means a bot will definitely get captchas.
Now there are few rules which I try and follow to decrease the chances of an captcha
TOR
is a big NO
Chrome
+ Selenium
+ Proxy
existing profile
. I prefer to have profiles which have browsing history with different websites, cookies from many other sites and trackers and going back month. You don't know how the evaluation of a user/bot difference may happen. So you want to look more like a real userThis is a cat and mouse game, where you don't know what the other party has as a defense. So you try to play nice and easy