Python pdfminer extract image produces multiple images per page (should be single image)

Erik picture Erik · Jul 12, 2016 · Viewed 8.6k times · Source

I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin).

I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides no result (as in cannot be opened via finder on OSX)

Sample:

fp = open('the_file.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)


for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    pdf_item = device.get_result()
    for thing in pdf_item:
        if isinstance(thing, LTImage):
            save_image(thing)
        if isinstance(thing, LTFigure):
            find_images_in_thing(thing)


def find_images_in_thing(outer_layout):
    for thing in outer_layout:
        if isinstance(thing, LTImage):
            save_image(thing)

save_image either writes a file per image in pageNum_imgNum format in 'wb' mode or a single image per page in 'a' mode. I have tried numerous file extensions with no luck.

Resources I've looked into:

http://denis.papathanasiou.org/posts/2010.08.04.post.html (outdatted pdfminer version) http://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

Answer

Nikhil Shinday picture Nikhil Shinday · Aug 23, 2017

It's been a while since this question has been asked, but I'll contribute for the sake of the community, and potentially for your benefit :)

I've been using an image parser called pdfimages, available through the poppler PDF processing framework. It also outputs several files per image; it seems like a relatively common behavior for PDF generators to 'tile' or 'strip' the images into multiple images that then need to be pieced together when scraping, but appear to be entirely intact while viewing the PDF. The formats/file extensions that I have seen through pdfimages and elsewhere are: png, tiff, jp2, jpg, ccitt. Have you tried all of those?