I know how to use pdfminer.six's pdf2txt.py tool in command line; however, I have many PDF files to convert to txt files and I can't just do it one-by-one in command line. I haven't found how to use this tool in actual python script. Any ideas?
The good news is that you can use the PDFMiner library to recreate any attributes/commands you might run with pdf2text on the command line. See below for a basic example I use:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
def pdf_to_text(path):
manager = PDFResourceManager()
retstr = BytesIO()
layout = LAParams(all_texts=True)
device = TextConverter(manager, retstr, laparams=layout)
filepath = open(path, 'rb')
interpreter = PDFPageInterpreter(manager, device)
for page in PDFPage.get_pages(filepath, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
filepath.close()
device.close()
retstr.close()
return text
if __name__ == "__main__":
text = pdf_to_text("yourfile.pdf")
print(text)
If you need to apply page numbers or passwords, those are optional parameters in PDFPage.get_pages. Likewise if you need to make layout changes such as all-texts or margin-size, there are optional attributes for the LAParams initializer