Programmatically Extract PDF Tables

markdigi picture markdigi · Aug 6, 2010 · Viewed 18.9k times · Source

I have a bunch of PDF docs with tabular data in them which I need to extract into a more readable format to store in a spreadsheet, database or whatever.

Is there anything out in the world (preferably free) that is able to get tabular data out of PDFs into a more readable format in bulk either natively integrated with an app or passively via command line or looping the process in code(.net)?

Can be any format really (doc, html) just as long as the tables are maintained.

Anything I've found so far is either a one-off (only does one doc at a time, I have hundreds, that isn't happening) or does not maintain the table structure.

Any ideas please post.

Answer

andersoj picture andersoj · Oct 15, 2010

This is a giant hassle. In general, extracting the text content of a PDF file is running against the grain of what PDF wants you to do.

Start by trying to get the text out. This may be more or less successful, depending on how the PDF is built. One place to start is GhostScript or pstotext. If that fails you, this guy has a list of text extraction tools. Once you have the text stream, you could then try to reassemble the tabular structure programmatically.

Finally, if you are in seriously bad shape, and if the PDFs don't cooperate, you could do the OCR thing. The right long term solution is to get the data into the right format at the outset, either by doing a single, massive, painful, and probably partially-manual process; or to go to the source and suggest that the data be provided in a more useable form.

If you can give a more specific PDF example file, there may be a better or more precise answer... there is NO general solution to this, if it's possible, it will need to be tailored to your specific source data.

Note this rather pointed response to the general question... doesn't help with the fact that you have the problem in front of you, but maybe it would provide useful topcover when explaining to your boss why there isn't an obvious answer? ;-)

A new SO question popped up, and referred to this library -- iTextSharp -- which looks possibly related. SO question: Best way to extract...