Transforming pdf to html in Python

Falcon picture Falcon · Dec 21, 2016 · Viewed 16.6k times · Source

Python 2.6

I'm trying to parse my pdf files and one way to do that is to transform it into html and extracting headings along with their paragraphs. So, I tried pdf2htmlEX and it converted my pdf into html without disturbing my pdf format... So far, I was happy but when I tried to access my headings by using such commands:

>> import subprocess

>> path = "/home/administrator/Documents/pdf_file.pdf"
>> subprocess.call(["pdf2htmlEX" , path])

But when I opened my html file it was giving me unnecessary stuff along with my text and more importantly my text doesn't have heading tags just bunch of divs and span.

 >> f = open('/home/administrator/Documents/pdf_file.html','r')
 >> f = f.read()
 >> print f

I even tried to access it using BeautifulSoup

>> from bs4 import BeautifulSoup as bs

>> soup = BeautifulSoup(f)
>> soup.find('div', attrs={'class': 'site-content'}).h1

It didn't gave me anything coz there was no tags. I have also tried HTMLParser

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class myhtmlparser(HTMLParser):
    def __init__(self):
         self.reset()
         self.NEWTAGS = []
         self.NEWATTRS = []
         self.HTMLDATA = []
    def handle_starttag(self, tag, attrs):
         self.NEWTAGS.append(tag)
         self.NEWATTRS.append(attrs)
    def handle_data(self, data):
         self.HTMLDATA.append(data)
    def clean(self):
         self.NEWTAGS = []
         self.NEWATTRS = []
         self.HTMLDATA = []

parser = myhtmlparser()
parser.feed(f)

# Extract data from parser
tags  = parser.NEWTAGS
attrs = parser.NEWATTRS
data  = parser.HTMLDATA

# Clean the parser
parser.clean()

# Print out our data
#print tags
print data

but they all are not fulfilling my required desire. All I want is to extract each headings along with their required paragraphs from that html file is that too much to ask... :p I searched almost every site and read almost everything on this but all my effort ends in vain. Plz guide me in this...

Answer

Rachel Liu picture Rachel Liu · Apr 13, 2019

If it's python3 and up, it should be

outputFilename = outputDir + filename.replace(".pdf",".html")
subprocess.run(["pdf2htmlEX",file,outputFilename])