I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.
Most of the issues are solvable and I'm having a good little mess around. However I'm hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.
Now my experience with dynamic web content is low, so this thing is something I'm having trouble getting my head around.
I think Java or Javascript is a key, this pops up often.
The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don't. I'm using the scrapy library with Python 2.7
I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?
Here is a simple example of scrapy
with an AJAX request. Let see the site rubin-kazan.ru.
All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, ...):
When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:
It doesn't reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:
And I observe the HTTP request that is responsible for message body:
After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):
And the form data content of the request (the HTTP method is "Post"):
And the content of response, which is a JSON file:
Which presents all the information I'm looking for.
From now, I must implement all this knowledge in scrapy. Let's define the spider for this purpose:
class spider(BaseSpider):
name = 'RubiGuesst'
start_urls = ['http://www.rubin-kazan.ru/guestbook.html']
def parse(self, response):
url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
formdata={'page': str(page + 1), 'uid': ''})
def RubiGuessItem(self, response):
json_file = response.body
In parse
function I have the response for first request.
In RubiGuessItem
I have the JSON file with all information.