Extract table data from PDF

Rajneesh picture Rajneesh · May 6, 2014 · Viewed 12k times · Source

Is there any consistent way to extract tables from PDF files? Any tools?

What I have done so far:

  • I have tried out pdftotext tool. It has an option to convert to HTML layout.

What is the problem with this:

  • The table information is not preserved in HTML output
  • I expected <table> tags, but everything was under <p> tags.

Will there be any markers in a PDF document to indicate table structures? Like <table>, <tr> and <td> in HTML?

If "yes", any pointers to this would be helpful. If "no", a definite info about this fact is also helpful.

Answer

user281681 picture user281681 · Jun 9, 2014

What you could do however, is use the pdftotext -layout input.pdf output.txt. It prints the pdf in a text file and contains the original layout. There are no tags, but with a bit of nifty scripting (perl / php / whatever), you can recover the data from the tables.

If you're working on a single page, you're probably better off doing it manually, but if you (like me) have to work on 100's or 1000's of pages, it's about the best you can get. I've been looking around for a long time and can't find any better pdf-2-text tool than pdftotext.

There is a bit of inconsistency in the output, not all similar pdf tables produce a similar looking txt output, but that makes your scripting a little more interesting.