what is the best way to extract data from pdf

Question 1

what is the best way to extract data from pdf

python node.js pdf pdf-scraping

e.iluf · Sep 14, 2019 · Viewed 14.9k times · Source

Answer

Answer

I did something similar to scrape my grades a long time ago. The easiest (not pretty) solution I found was to convert the pdf to html, then parse the html.

To do so I used pdf2text/pdf2html (https://pypi.org/project/pdf-tools/) and html.
I also used codecs but don't remember exactly the why behind this.

A quick and dirty summary:

from lxml import html
import codecs
import os

# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
    # Do things
file.close()

Just check the result of pdf2text or pdf2html, then using xpath you should extract your information easily.

I hope it helps!

EDIT: comment code

EDIT2: The following code is printing your data

# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")

from lxml import html
import codecs
import os

file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
    line_elements = elm.split()
    # Just observed that what you need starts with a number
    if len(line_elements) > 0 and line_elements[0].isdigit():
        print(line_elements)
file.close();

Question 2

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.

I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs. I attempted using python with this code

but I got stuck on how to find the procurement history. What is the best way to extract the procurement history from the pdf?

what is the best way to extract data from pdf

Answer

Related questions