Extracting entire pdf data with python pdfminer

sunil reddy picture sunil reddy · Jun 8, 2013 · Viewed 10.8k times · Source

I am using pdfminer to extract data from pdf files using python. I would like to extract all the data present in pdf irrespective of wheather it is an image or text or whatever it is. Can we do that in a single line(or two if needed, without much work). Any help is appreciated. Thanks in advance

Answer

alexis picture alexis · Jun 9, 2013

Can we do that in a single line(or two if needed, without much work).

No, you cannot. Pdfminer is powerful but it's rather low-level.

Unfortunately, the documentation is not exactly exhaustive. I was able to find my way around it thanks to some code by Denis Papathanasiou. The code is discussed in his blog, and you can find the source here: layout_scanner.py

See also this answer, where I give a little more detail.