How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.
I'm using python 2.7 but can use 3.x if required.
You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.
import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None