Extract data from STATIC HTML FILE using python 3.5

user73324 picture user73324 · Jan 3, 2017 · Viewed 7k times · Source

I have static HTML page saved on local machine. I tried using simple file open and BeautifulSoup. With file open its doesn't read entire html file due to unicode error and BeautifulSoup it works for live websites.

#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
    print(university['href']+","+university.string)


#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
    for line in f:
        print(repr(line))

After reading HTML, I wish to extract data from ul and li which doesn't have any attributes. Any recommendation are welcome.

Answer

yumere picture yumere · Jan 3, 2017

I don't know what you exactly mean. I just understand that you want to read entire html data from local storage and parse some DOM with bs4.

right?

I suggest some code here:

from bs4 import BeautifulSoup

with open("Stack Overflow.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'html.parser')
    # universities = soup.find_all('a', class_='institution')
    # for university in universities:
    #     print(university['href'] + "," + university.string)
    ul_list = soup.select("ul")
    for ul in ul_list:
        if not ul.attrs:
            for li in ul.select("li"):
                if not li.attrs:
                    print(li.get_text().strip())