I'm trying to parse my pdf files and one way to do that is to transform it into html and extracting headings along with their paragraphs. So, I tried pdf2htmlEX and it converted my pdf into html without disturbing my pdf format... So far, I was happy but when I tried to access my headings by using such commands:
>> import subprocess
>> path = "/home/administrator/Documents/pdf_file.pdf"
>> subprocess.call(["pdf2htmlEX" , path])
But when I opened my html file it was giving me unnecessary stuff along with my text and more importantly my text doesn't have heading tags just bunch of divs and span.
>> f = open('/home/administrator/Documents/pdf_file.html','r')
>> f = f.read()
>> print f
I even tried to access it using BeautifulSoup
>> from bs4 import BeautifulSoup as bs
>> soup = BeautifulSoup(f)
>> soup.find('div', attrs={'class': 'site-content'}).h1
It didn't gave me anything coz there was no tags. I have also tried HTMLParser
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class myhtmlparser(HTMLParser):
def __init__(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self, tag, attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self, data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
parser = myhtmlparser()
parser.feed(f)
# Extract data from parser
tags = parser.NEWTAGS
attrs = parser.NEWATTRS
data = parser.HTMLDATA
# Clean the parser
parser.clean()
# Print out our data
#print tags
print data
but they all are not fulfilling my required desire. All I want is to extract each headings along with their required paragraphs from that html file is that too much to ask... :p I searched almost every site and read almost everything on this but all my effort ends in vain. Plz guide me in this...
If it's python3 and up, it should be
outputFilename = outputDir + filename.replace(".pdf",".html")
subprocess.run(["pdf2htmlEX",file,outputFilename])