Python pdftotext ShellError Using textract

bsheehy picture bsheehy · Apr 8, 2015 · Viewed 8.8k times · Source

When I run the below Python script on a directory that contains a PDF file, I keep getting this error:

ShellError: The command pdftotext "path/to/pdf/title.pdf" - failed with exit code 1 ------------- stdout ------------- ------------- stderr ------------- 'pdftotext' is not recognized as an internal or external command, operable program or batch file.

I have verified that pdf2text and PDFMiner are installed properly. This is my first time using textract and it works great on all other file types (Word docs, PowerPoint docs, Excel docs, etc.). Why is the process calling pdftotext when pdf2text is the actual library?

import os
import os.path
import textract

pdf_path = 'path/to/pdf/'

for fname in os.listdir(pdf_path):
    if os.path.isfile(pdf_path+fname ):
        f = textract.process(pdf_path+fname )
        if 'string' in f:
            print fname

Thanks!

Answer

AGerdom picture AGerdom · Jul 2, 2015

I just got done dealing with this issue myself. From what I understand, the confusion is that pdftotext is a command utility that is popular in linux, whereas pdf2text is a wrapper for the PDFMiner package. My windows binary for poppler and pdftotext is from an archive.org link so I don't feel right linking to it here, but here's a link I found on the wikipedia page for a windows binary. From what I've been able to tell, pdftotext tends to give better output than pdfMiner. The issue I was having that was generating the same error you were recieving is that pdftotext.exe was installed, and in my path, but I would receive the error if I didn't start the python script through the cmdline.

If you end up downloading it, it comes with some other nice utilities like pdftohtml and pdftops. Personal favorite though is pdftotext -layout whatever.txt which will print a pdf to stdout as plaintext with everything in place.

tl;dr Try running opening a cmdline and running the program. If you still might try (1) install a windows binary (assuming you're on windows) or (2) try updating textract with

pip install textract --upgrade

Hopefully that helps!