How to convert the extracted text from PDF to JSON or XML format in Python?

Avi picture Avi · Oct 6, 2018 · Viewed 10.6k times · Source

I am using PyPDF2 to extract the data from PDF file and then converting into Text format?

PDF format for the file is like this:

Name : John 
Address: 123street , USA 
Phone No:  123456
Gender: Male 

Name : Jim 
Address:  456street , USA 
Phone No:  456899
Gender: Male 

In Python I am using this code:

import PyPDF2
pdf_file = open('C:\\Users\\Desktop\\Sampletest.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
page_content

This is the outcome which I get from page_content:

 'Name : John \n \nAddress: 123street , USA \n \nPhone No:  123456\n \nGender: Male \n \n \nName : Jim \n \nAddress:  456street , USA \n \nPhone No:  456899\n \nGender: Male \n \n \n'

How do I format it in a JSON or XML format so that I could use extracted data in SQL server database.

I tried using this approach as well

import json
data = json.dumps(page_content)
formatj = json.loads(data)
print (formatj)

Output:

Name : John 
Address: 123street , USA 
Phone No:  123456
Gender: Male 

Name : Jim 
Address:  456street , USA 
Phone No:  456899
Gender: Male 

This is the same output which I have in my word file, but I don't think that this is in JSON format.

Answer

UtahJarhead picture UtahJarhead · Oct 6, 2018

Not so pretty, but this would get the job done, I think. You would get a dictionary which then gets printed by the json parser in a nice, pretty format.

import json    

def get_data(page_content):
    _dict = {}
    page_content_list = page_content.splitlines()
    for line in page_content_list:
        if ':' not in line:
            continue
        key, value = line.split(':')
        _dict[key.strip()] = value.strip()
    return _dict

page_data = get_data(page_content)
json_data = json.dumps(page_data, indent=4)
print(json_data)

or, instead of those last 3 lines, just do this:

print(json.dumps(get_data(page_content), indent=4))