How to extract table as text from the PDF using Python?

venkat picture venkat · Nov 28, 2017 · Viewed 120.9k times · Source

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.

import PyPDF2

PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored

pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

pg4 = pfr.getPage(126) #extract pg 127

writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)

NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
    writer.write(outputStream) #write pages to new PDF

My goal is to extract the table from the whole PDF document.

Please have a look at the sample image of a page in PDF

Answer

Himanshu Poddar picture Himanshu Poddar · Oct 29, 2018
  • I would suggest you to extract the table using tabula.
  • Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe.
  • Each table in your pdf is returned as one dataframe.
  • The table will be returned in a list of dataframea, for working with dataframe you need pandas.

This is my code for extracting pdf.

import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here'  + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

Please refer to this repo of mine for more details.