How to use pdfminer.six's pdf2txt.py in python script and outside command line?

Ashley Liu picture Ashley Liu · Sep 20, 2018 · Viewed 10.9k times · Source

I know how to use pdfminer.six's pdf2txt.py tool in command line; however, I have many PDF files to convert to txt files and I can't just do it one-by-one in command line. I haven't found how to use this tool in actual python script. Any ideas?

Answer

pseudoku picture pseudoku · Sep 20, 2018

The good news is that you can use the PDFMiner library to recreate any attributes/commands you might run with pdf2text on the command line. See below for a basic example I use:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

def pdf_to_text(path):
    manager = PDFResourceManager()
    retstr = BytesIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)

    for page in PDFPage.get_pages(filepath, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    filepath.close()
    device.close()
    retstr.close()
    return text


if __name__ == "__main__":
    text = pdf_to_text("yourfile.pdf")
    print(text)

If you need to apply page numbers or passwords, those are optional parameters in PDFPage.get_pages. Likewise if you need to make layout changes such as all-texts or margin-size, there are optional attributes for the LAParams initializer