Extract images from PDF using python PyPDF2

Max Kamenkov picture Max Kamenkov · Dec 2, 2013 · Viewed 13.8k times · Source

Is there any way to extract images as stream from pdf document (using PyPDF2 library)? Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?

I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.

>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')

I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.

Any code samples and links will be very helpful.

Answer

jainam shah picture jainam shah · May 30, 2019
import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None