How to convert multipage PDF to list of image objects in Python?

Hendrik picture Hendrik · Mar 28, 2017 · Viewed 21.4k times · Source

I'd like to turn a multipage PDF document into a series of image object in list structure, without saving the images in disk (I'd like to process them with PIL Image)in Python. So far I can only do this to write the images into files first:

from wand.image import Image

with Image(filename='source.pdf') as img:

    with img.convert('png') as converted:
        converted.save(filename='pyout/page.png')

But how could I turn the img objects above directly into list of PIL.Image objects?

Answer

Bryant Kou picture Bryant Kou · Jul 21, 2017

new answer:

pip install pdf2image

from pdf2image import convert_from_path, convert_from_bytes
images = convert_from_path('/path/to/my.pdf')

You may need to install pillow as well. This might only work on linux.

https://github.com/Belval/pdf2image

Results may be different between the two methods.

old answer:

Python 3.4:

from PIL import Image
from wand.image import Image as wimage
import os
import io

if __name__ == "__main__":
    filepath = "fill this in"
    assert os.path.exists(filepath)
    page_images = []
    with wimage(filename=filepath, resolution=200) as img:
        for page_wand_image_seq in img.sequence:
            page_wand_image = wimage(page_wand_image_seq)
            page_jpeg_bytes = page_wand_image.make_blob(format="jpeg")
            page_jpeg_data = io.BytesIO(page_jpeg_bytes)
            page_image = Image.open(page_jpeg_data)
            page_images.append(page_image)

Lastly, you can make a system call to mogrify, but that can be more complicated as you need to manage temporary files.