I'm trying to scrape data from the public site asx.com.au
The page http://www.asx.com.au/asx/research/company.do#!/ACB/details contains a div
with class 'view-content', which has the information I need:
But when I try to view this page via Python's urllib2.urlopen
that div is empty:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.asx.com.au/asx/research/company.do#!/ACB/details'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
contentDiv = soup.find("div", {"class": "view-content"})
print(contentDiv)
# the results is an empty div:
# <div class="view-content" ui-view=""></div>
Is it possible to access the contents of that div programmatically?
Edit: as per the comment it appears that the content is rendered via Angular.js
. Is it possible to trigger the rendering of that content via Python?
This page use JavaScript
to read data from server and fill page.
I see you use developer tools in Chrome
- see in tab Network
on XHR
or JS
requests.
I found this url:
This url gives all data almost in JSON format
But if you use this link without &callback=angular.callbacks._0
then you get data in pure JSON format and you will could use json
module to convert it to python dictionary.
EDIT: working code
import urllib2
import json
# new url
url = 'http://data.asx.com.au/data/1/company/ACB?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices'
# read all data
page = urllib2.urlopen(url).read()
# convert json text to python dictionary
data = json.loads(page)
print(data['principal_activities'])
Output:
Mineral exploration in Botswana, China and Australia.
EDIT (2020.12.23)
This answer is almost 5 years old and was created for Python2. Now in Python3 it would need urllib.request.urlopen()
or requests.get()
but real problem is that for 5 years this page changed structure and technologie. Urls (in question and answer) don't exists any more. This page would need new analyze and new method.
In question was url
http://www.asx.com.au/asx/research/company.do#!/ACB/details
but currently page uses url
https://www2.asx.com.au/markets/company/acb
And it use different urls for AJAX
,XHR
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/announcements
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/key-statistics
etc.
You can find more urls using DevTools
in Chrome
/Firefox
(tab: Network
, filter: XHR
)
import urllib.request
import json
# new url
url = 'https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about'
# read all data
page = urllib.request.urlopen(url).read()
# convert json text to python dictionary
data = json.loads(page)
print(data['data']['description'])
Output:
Minerals exploration & development