I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.
I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs. I attempted using python with this code
but I got stuck on how to find the procurement history. What is the best way to extract the procurement history from the pdf?
I did something similar to scrape my grades a long time ago. The easiest (not pretty) solution I found was to convert the pdf to html, then parse the html.
To do so I used pdf2text/pdf2html (https://pypi.org/project/pdf-tools/) and html.
I also used codecs but don't remember exactly the why behind this.
A quick and dirty summary:
from lxml import html
import codecs
import os
# First convert the pdf to text/html
# You can skip this step if you already did it
os.system("pdf2txt -o file.html file.pdf")
# Open the file and read it
file = codecs.open("file.html", "r", "utf-8")
data = file.read()
# We know we're dealing with html, let's load it
html_file = html.fromstring(data)
# As it's an html object, we can use xpath to get the data we need
# In the following I get the text from <div><span>MY TEXT</span><div>
extracted_data = html_file.xpath('//div//span/text()')
# It returns an array of elements, let's process it
for elm in extracted_data:
# Do things
file.close()
Just check the result of pdf2text or pdf2html, then using xpath you should extract your information easily.
I hope it helps!
EDIT: comment code
EDIT2: The following code is printing your data
# Assuming you're only giving the page 4 of your document
# os.system("pdf2html test-page4.pdf > test-page4.html")
from lxml import html
import codecs
import os
file = codecs.open("test-page4.html", "r", "utf-8")
data = file.read()
html_file = html.fromstring(data)
# I updated xpath to your need
extracted_data = html_file.xpath('//div//p//span/text()')
for elm in extracted_data:
line_elements = elm.split()
# Just observed that what you need starts with a number
if len(line_elements) > 0 and line_elements[0].isdigit():
print(line_elements)
file.close();