Extract table from a PDF

python pdf pdf-parsing

meadhikari · Jul 11, 2013 · Viewed 9.5k times · Source

I am trying to extract a table from a pdf document

I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces garbage, maybe because of the font, the document is not in english.

Extracting the pdf using x and y coordinate is not an option as this solution needs to work for future pdf from the url mention above which will have the table but not always in the same position.

Please help,

Thanks in advance.

Answer

The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.

Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.

Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.

This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.

Thus reliable text extraction from your document without OCR is impossible after all!

(Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)

Extract table from a PDF

Answer

Related questions