how to extract formatted text content from PDF

python pdf text extract google-docs

hoju · Feb 4, 2010 · Viewed 23k times · Source

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?

Answer

To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.

I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.

how to extract formatted text content from PDF

Answer

Related questions