When I run the below Python script on a directory that contains a PDF file, I keep getting this error:
ShellError: The command
pdftotext "path/to/pdf/title.pdf" -
failed with exit code 1 ------------- stdout ------------- ------------- stderr ------------- 'pdftotext' is not recognized as an internal or external command, operable program or batch file.
I have verified that pdf2text and PDFMiner are installed properly. This is my first time using textract and it works great on all other file types (Word docs, PowerPoint docs, Excel docs, etc.). Why is the process calling pdftotext
when pdf2text
is the actual library?
import os
import os.path
import textract
pdf_path = 'path/to/pdf/'
for fname in os.listdir(pdf_path):
if os.path.isfile(pdf_path+fname ):
f = textract.process(pdf_path+fname )
if 'string' in f:
print fname
Thanks!
I just got done dealing with this issue myself. From what I understand, the confusion is that pdftotext is a command utility that is popular in linux, whereas pdf2text is a wrapper for the PDFMiner package. My windows binary for poppler and pdftotext is from an archive.org link so I don't feel right linking to it here, but here's a link I found on the wikipedia page for a windows binary. From what I've been able to tell, pdftotext tends to give better output than pdfMiner. The issue I was having that was generating the same error you were recieving is that pdftotext.exe
was installed, and in my path, but I would receive the error if I didn't start the python script through the cmdline.
If you end up downloading it, it comes with some other nice utilities like pdftohtml and pdftops. Personal favorite though is pdftotext -layout whatever.txt
which will print a pdf to stdout as plaintext with everything in place.
tl;dr Try running opening a cmdline and running the program. If you still might try (1) install a windows binary (assuming you're on windows) or (2) try updating textract with
pip install textract --upgrade
Hopefully that helps!