extract images from pdf using pdfbox

Pradyut Bhattacharya picture Pradyut Bhattacharya · Jan 2, 2012 · Viewed 43.1k times · Source

I m trying to extract images from a pdf using pdfbox. The example pdf here

But i m getting blank images only.

The code i m trying:-

public static void main(String[] args) {
   PDFImageExtract obj = new PDFImageExtract();
    try {
        obj.read_pdf();
    } catch (IOException ex) {
        System.out.println("" + ex);
    }

}

 void read_pdf() throws IOException {
    PDDocument document = null; 
    try {
        document = PDDocument.load("C:\\Users\\Pradyut\\Documents\\MCS-034.pdf");
    } catch (IOException ex) {
        System.out.println("" + ex);
    }
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator(); 
    int i =1;
    String name = null;

    while (iter.hasNext()) {
        PDPage page = (PDPage) iter.next();
        PDResources resources = page.getResources();
        Map pageImages = resources.getImages();
        if (pageImages != null) { 
            Iterator imageIter = pageImages.keySet().iterator();
            while (imageIter.hasNext()) {
                String key = (String) imageIter.next();
                PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);
                i ++;
            }
        }
    }

}

Thanks

Answer

Matt picture Matt · Jun 6, 2016

Here is code using PDFBox 2.0.1 that will get a list of all images from the PDF. This is different than the other code in that it will recurse through the document instead of trying to get the images from the top level.

public List<RenderedImage> getImagesFromPDF(PDDocument document) throws IOException {
        List<RenderedImage> images = new ArrayList<>();
    for (PDPage page : document.getPages()) {
        images.addAll(getImagesFromResources(page.getResources()));
    }

    return images;
}

private List<RenderedImage> getImagesFromResources(PDResources resources) throws IOException {
    List<RenderedImage> images = new ArrayList<>();

    for (COSName xObjectName : resources.getXObjectNames()) {
        PDXObject xObject = resources.getXObject(xObjectName);

        if (xObject instanceof PDFormXObject) {
            images.addAll(getImagesFromResources(((PDFormXObject) xObject).getResources()));
        } else if (xObject instanceof PDImageXObject) {
            images.add(((PDImageXObject) xObject).getImage());
        }
    }

    return images;
}