Python module for converting PDF to text

cnu picture cnu · Aug 25, 2008 · Viewed 366.7k times · Source

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

Answer

David Crow picture David Crow · Aug 25, 2008

Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format.

The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.

A Python 3 version is available under: