Extract images from PDF using python PyPDF2

Question 1

Extract images from PDF using python PyPDF2

python pdf image-processing reportlab pypdf

Max Kamenkov · Dec 2, 2013 · Viewed 13.8k times · Source

Answer

Answer

import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

Question 2

Is there any way to extract images as stream from pdf document (using PyPDF2 library)? Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?

I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.

>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')

I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.

Any code samples and links will be very helpful.

Extract images from PDF using python PyPDF2

Answer

Related questions