I am working with some very large PDFs, some over 7GB in size. The PDFs have up to 20,000 pages and many full page color images. I'd like to use PDFBox to work with the PDFs, but due to the size I get OutOfMemoryError's when I attempt to open the PDFs.
I'm working with version pdfbox-app-1.6.0, on Windows 7 using Intellij, java 6.
First I tried writing a simple program that just opened the PDF in a PDDocument and coping each page over to another PDDocument: http://ideone.com/arKhB
Next I tried using the PDFBox CopyDoc example.
Both example run out of memory.
I'm assuming this is because PDFBox is trying to read the whole document into memory. Is there a way to have it only open 1 page at a time? I know it would be slower processing, but at the moment I can't process anything.
In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.
Update 17.4.2018: More tricks to save memory are described in the FAQ. Not yet described but active since 2.0.9 is subsampling (skip pixel lines/rows) with PDFRenderer.setSubsamplingAllowed(true)
when doing rendering. This saves space for PDF files with huge image files.