Issues with PyMuPDF extracting plain text

PyRsquared picture PyRsquared · Jun 4, 2018 · Viewed 8.2k times · Source

I want to read in a PDF file using PyMuPDF. All I need is plain text (no need to extract info on color, fonts, tables etc.).

I have tried the following

import fitz
from fitz import TextPage
ifile = "C:\\user\\docs\\aPDFfile.pdf"
doc = TextPage(ifile)
>>> TypeError: in method 'new_TextPage', argument 1 of type 'struct fz_rect_s *'

Which doesn't work, so then I tried

doc = fitz.Document(ifile)
t = TextPage.extractText(doc)
>>> AttributeError: 'Document' object has no attribute '_extractText'

which again doesn't work.

Then I found a great blog from one of the authors of PyMuPDF which has detailed code on extracting text in the order it is read from the file. But everytime I run this code with a different PDF I get KeyError: 'lines' (line 81 in the code) or KeyError: "bbox" (line 60 in the code).

I can't post the PDF's here because they are confidential, and I appreciate that would be useful information to have here. But is there any way I can just do the simplest task which PyMuPDF is meant to do: extract plain text from a PDF, un-ordered or otherwise (I don't mind much)?

Answer

Vasko picture Vasko · Jan 14, 2019

The process of extracting text following your example using PyMuPDF is:

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.getText()
print(text)

The blog you followed is great, but a little bit outdated, some of the methods are depreciated.