I have written my first bit of python code to scrape a website.
import csv
import urllib2
from BeautifulSoup import BeautifulSoup
c = csv.writer(open("data.csv", "wb"))
soup = BeautifulSoup(urllib2.urlopen('http://www.kitco.com/kitco-gold-index.html').read())
table = soup.find('table', id="datatable_main")
rows = table.findAll('tr')[1:]
for tr in rows:
cols = tr.findAll('td')
text = []
for td in cols:
text.append(td.find(text=True))
c.writerow(text)
When I test it locally in my ide called pyCharm it works good but when I try it out on my server which runs CentOS, I get the following error:
domainname.com [~/public_html/livegold]# python scraper.py
Traceback (most recent call last):
File "scraper.py", line 8, in <module>
rows = table.findAll('tr')[:]
AttributeError: 'NoneType' object has no attribute 'findAll'
I'm guessing I don't have a module installed remotely, I've been hung up on this for two days any help would be greatly appreciated! :)
You are ignoring any errors that could occur in urllib2.urlopen
, if for some reason you are getting an error trying to get that page on your server, which you don't get testing locally you are effectively passing in an empty string (''
) or a page you don't expect (such as a 404 page) to BeautifulSoup
.
Which in turn makes your soup.find('table', id="datatable_main")
return None
since the document is something you don't expect.
You should either make sure you can get the page you are trying to get on your server, or handle exceptions properly.