how to extract formatted text content from PDF

hoju picture hoju · Feb 4, 2010 · Viewed 23k times · Source

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?

Answer

Etienne picture Etienne · Feb 4, 2010

To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.

I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.