import urllib2
website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()
print html
So far so good.
But I want only href links from the plain text HTML. How can I solve this problem?
Try with Beautifulsoup:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')
In case you just want links starting with http://
, you should use:
soup.findAll('a', attrs={'href': re.compile("^http://")})
In Python 3 with BS4 it should be:
from bs4 import BeautifulSoup
import urllib.request
html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
print(link.get('href'))