At the moment I'm looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I'm looking into scraping each page for its page number to determine the order it should go in (e.g. if someone split up a book into 20 10-page PDFs and I want to put them back together).
I have two questions - 1.) I know that sometimes the page number is stored in the document data somewhere, as I've seen PDFs that render on Adobe as something like [1243] (10 of 150), but I've read documents of this sort into pyPDF and I can't find any information indicating the page number - where is this stored?
2.) If avenue #1 isn't available, I think I could iterate through the objects on a given page to try to find a page number - likely it would be its own object that has a single number in it. However, I can't seem to find any clear way to determine the contents of objects. If I run:
pdf.getPage(0).getContents()
This usually either returns:
{'/Filter': '/FlateDecode'}
or it returns a list of IndirectObject(num, num) objects. I don't really know what to do with either of these and there's no real documentation on it as far as I can tell. Is anyone familiar with this kind of thing that could point me in the right direction?
The following worked for me:
from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open('path/to/file.pdf','rb'))
pdf.getNumPages()