Warnings on pdfminer

rodrigocf picture rodrigocf · Apr 21, 2015 · Viewed 9.9k times · Source

I have found and (slightly) modified this script in stackoverflow for it to work on python 3.3:

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    fp = open(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    string = retstr.getvalue()
    retstr.close()
    return string


print(convert_pdf('abc.pdf'))

It works fine, however i seem to be having 2 issues:

  • While running the script I get tons of warnings:

    WARNING:root:undefined: PDFCIDFont: basefont='LKOELN+Wingdings-Regular', cidcoding='Adobe-Identity', 139
    WARNING:root:undefined: PDFCIDFont: basefont='LKKPCF+Wingdings2', cidcoding='Adobe-Identity', 132

Which in the printed text looks like (cid:139), how do I catch this warnings and replace that text with something else?

  • Note that I have a codec line, which in the original script goes inside the TextConverter(rsrcmgr, retstr, laparams=laparams), however I get:

    Traceback (most recent call last): File "C:/Users/rodrigo/Desktop/csp_pdf/csp_pdf2.py", line 46, in convert_pdf('abc.pdf') File "C:/Users/rodrigo/Desktop/csp_pdf/csp_pdf2.py", line 33, in convert_pdf device = TextConverter(rsrcmgr, retstr, codec = 'utf-8', laparams=laparams) TypeError: init() got an unexpected keyword argument 'codec'

Is this related to the first issue?

Thanks!

Answer

Pullie picture Pullie · Nov 25, 2015

Pdfminer3k logs to the Python root logger unfortunately. PDFMiner should implement logging correctly IMHO. So it is not possible to disable logging in the normal manner like.

logging.getLogger("pdfminer").setLevel(logging.WARNING)

Bummer!

I did this and it works™:

    logging.propagate = False 
    logging.getLogger().setLevel(logging.ERROR)

It sets the root logger to level Error. This will stop PDFMiner warn logging, since it logs to the root logger, but not your own logging.

I needed to set propagation to False, because after PDFMiner usage, I had duplicate logging entries. This was caused by the root logger.