I want to read in a PDF file using PyMuPDF. All I need is plain text (no need to extract info on color, fonts, tables etc.).
I have tried the following
import fitz
from fitz import TextPage
ifile = "C:\\user\\docs\\aPDFfile.pdf"
doc = TextPage(ifile)
>>> TypeError: in method 'new_TextPage', argument 1 of type 'struct fz_rect_s *'
Which doesn't work, so then I tried
doc = fitz.Document(ifile)
t = TextPage.extractText(doc)
>>> AttributeError: 'Document' object has no attribute '_extractText'
which again doesn't work.
Then I found a great blog from one of the authors of PyMuPDF which has detailed code on extracting text in the order it is read from the file. But everytime I run this code with a different PDF I get KeyError: 'lines'
(line 81 in the code) or KeyError: "bbox"
(line 60 in the code).
I can't post the PDF's here because they are confidential, and I appreciate that would be useful information to have here. But is there any way I can just do the simplest task which PyMuPDF is meant to do: extract plain text from a PDF, un-ordered or otherwise (I don't mind much)?
The process of extracting text following your example using PyMuPDF is:
import fitz
filepath = "C:\\user\\docs\\aPDFfile.pdf"
text = ''
with fitz.open(filepath ) as doc:
for page in doc:
text+= page.getText()
print(text)
The blog you followed is great, but a little bit outdated, some of the methods are depreciated.