I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?
environment: PYTHON 3.6
The below code will work, to extract data text data from both searchable and non-searchable PDF's.
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
If you don't have fitz
module you need to do this:
pip install --upgrade pymupdf