I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.
Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point. To get rid of this problem, I wrote my settings file like this
BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'
SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'
This is my program:
class ypSpider(CrawlSpider):
name = "yp"
start_urls = [
SOME URL
]
rules=(
#These are some rules
)
def parse_item(self, response):
####################################################################
#cleaning the html page by removing scripts html tags
#######################################################
hxs=HtmlXPathSelector(response)
The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'
As one of our member suggested, I started tor and I set HTTP_PROXY to
set http_proxy=http://localhost:8118
but it is throwing some errors,
failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError' Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.
So i changed http_proxy to
set http_proxy=http://localhost:9051
Now the error is
failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.
I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy. Which bundle of TOR I am supposed to use and how? I hope that both of my questions will be resolved
TOR by itself is not an http proxy, the port 8118 and the connection refused error suggest that you don't have privoxy[1] running properly. Try setting up privoxy correctly and then try again using the environment variable http_proxy=http://localhost:8118
.
I have done crawling through TOR using privoxy with scrapy successfully.