Reading pdf files line by line using python

Question 1

Reading pdf files line by line using python

python pypdf

Rahul Pipalia · Jul 8, 2017 · Viewed 12.4k times · Source

Answer

Answer

import re
import PyPDF2

pdfFileObj = open('E://drive-download-20171015T225604Z-001/test_case/test2/try/xyz.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print("Number of pages:-"+str(pdfReader.numPages))
num = pdfReader.numPages
i =0
while(i<num):
    pageObj = pdfReader.getPage(i)
    text=pageObj.extractText()
    text1 = text.lower()
    for line in text1:
        if(re.search("abc",line)):
            print(line)
    i= i+1

I use it to iterate page by page of pdf and search for key terms in it and process further.

Question 2

I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?

>>> import os 

>>> from PyPDF2 import PdfFileReader, PdfFileWriter

>>> path = "/Users/Rahul/Desktop/Dfiles/"

>>> dirs = os.listdir( path )

>>> directory = "/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf"

>>> f = open(directory, 'rb')

>>> reader = PdfFileReader(f)

>>> contents = reader.getPage(0).extractText().split('\n')

>>> f.close()

>>> print contents

The output is [u''] instead of reading the content.

Reading pdf files line by line using python

Answer

Related questions