How to extract a PDF's text using pdfrw

Roman picture Roman · Feb 7, 2017 · Viewed 8.8k times · Source

Can pdfrw extract the text out of a document?

I was thinking something along the lines of

from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
    page_texts.append(doc.getPage(page_nr).parse_page())  # ..or something

Answer

maxTwo picture maxTwo · Apr 22, 2018

In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.

from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
    bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
    string = #somehow decode bytestream. Maybe using zlib.decompress
    # do something with that text

Edit: May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.