what is the best way to extract data from pdf

e.iluf picture e.iluf · Sep 14, 2019 · Viewed 14.9k times · Source

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.

enter image description here

I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs. I attempted using python with this code

but I got stuck on how to find the procurement history. What is the best way to extract the procurement history from the pdf?

Answer

ZHAJOR picture ZHAJOR · Sep 15, 2019

I did something similar to scrape my grades a long time ago. The easiest (not pretty) solution I found was to convert the pdf to html, then parse the html.

To do so I used pdf2text/pdf2html (https://pypi.org/project/pdf-tools/) and html.
I also used codecs but don't remember exactly the why behind this.

A quick and dirty summary:

from lxml import html
import codecs
import os

# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
    # Do things
file.close()

Just check the result of pdf2text or pdf2html, then using xpath you should extract your information easily.

I hope it helps!

EDIT: comment code

EDIT2: The following code is printing your data

# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")

from lxml import html
import codecs
import os

file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
    line_elements = elm.split()
    # Just observed that what you need starts with a number
    if len(line_elements) > 0 and line_elements[0].isdigit():
        print(line_elements)
file.close();