pdfminer3k has no method named create_pages in PDFPage

Jack_of_All_Trades picture Jack_of_All_Trades · Oct 16, 2014 · Viewed 8.7k times · Source

Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example:

They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It seems like there is no new documentation to reflect any of the changes.

Answer

CPB picture CPB · Jan 2, 2015

If you are interested in reading text from a pdf file the following code works with pdfminer3k using python 3.4.

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

fp = open('file.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            print(lt_obj.get_text())

fp.close()