PDF table extraction

Vilius picture Vilius · Apr 24, 2012 · Viewed 23.1k times · Source

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

  • PDFBox || iText (Java)
  • Google Docs Import
  • PDF2HTML || PDF2Table

GIF

  • Tesseract-OCR

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

Answer

thadk picture thadk · Jan 29, 2014

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.