Python requests vs. robots.txt

Question 1

Python requests vs. robots.txt

python beautifulsoup python-requests robots.txt

Austin · Nov 10, 2013 · Viewed 7.5k times · Source

Answer

Answer

What is most likely happening is the Server is checking the user-agent and denying access to the default user-agent used by bots.

For example requests sets the user-agent to something like python-requests/2.9.1

You can specify the headers your self.

url = "https://google.com"
UAS = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1", 
       "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
       "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )

ua = UAS[random.randrange(len(UAS))]

headers = {'user-agent': ua}
r = requests.get(url, headers=headers)

Question 2

I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents.

I'm using python with requests and BeautifulSoup to scrape the data, but when I try to grab the content of the website with requests, I run into the following:

'<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9_4E402615&incident_id=133000790078576866-343390778581910775&edet=12&cinfo=4bb304cac75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133000790078576866-343390778581910775</iframe></html>'

I've done a bit of research, and it looks like this is what's stopping me: http://www.robotstxt.org/meta.html

Is there any way I can convince the website that I'm not a malicious robot? This is a script I run ~1 time per day on a single bit of source, so I'm not really a burden on their servers by any means. Just someone with a script to make things easier :)

EDIT: Tried switching to mechanize and ignoring robots.txt that way, but I'm not getting a 403 Forbidden response. I suppose they have changed their stance on scraping and have not updated their TOS yet. Time to go to Plan B, by no longer using the website unless anyone has any other ideas.

Python requests vs. robots.txt

Answer

Related questions