Working on tables in pdf using python

sam picture sam · Mar 20, 2012 · Viewed 12.4k times · Source

I am working on a pdf file. There is number of tables in that pdf.
According to the table names given in the pdf, I wanted to fetch the data from that table using python.

I have worked on html, xlm parsing but never with pdf.
Can anyone tell me how to fetch tables from pdf using python?

Answer

Sandro Munda picture Sandro Munda · Mar 21, 2012

I think that you need a python parser library. The most famous is PDFMiner.

According to the documentation :

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.