Python module for converting PDF to text

python pdf text-extraction pdf-scraping

cnu · Aug 25, 2008 · Viewed 366.7k times · Source

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

Answer

Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format.

The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.

A Python 3 version is available under:

https://github.com/pdfminer/pdfminer.six

Python module for converting PDF to text

Answer

Related questions