PyPDF2 insists on removing all the spaces

Steve picture Steve · Apr 28, 2016 · Viewed 7.4k times · Source

I have read a number of other stackoverflow answers and have yet to find a satisfactory answer to this, but it has been asked before. When I attempt to use PyPDF2 to read pdf documents it merges all of the words in a sentences into one continous string. Has anyone made any progess in figuring out how to avoid this. Below is the code

 import PyPDF2
 import pandas as pd

 import  struct as struct

 from nltk import word_tokenize

 pdfFileObj = open("notes.pdf", 'rb')

  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

 ## reading pages fine
 print(type(pdfReader.numPages))

## read in the pages 
pageObj = pdfReader.getPage(0)

 print(pageObj.extractText())

below is a sample of the output

2)Explanationofthedifferencebetweenprobabilityandstatistics.Theroleofprobability
instatisticaldecisionmaking.ExamplesoftheuseofProbabilityinStatistics.
3)Datasummarization(graphicalandnumerical)

4)Probabilityandrandomvariables

Answer

Steve picture Steve · May 6, 2016

Never figured out how to remove the spaces, it is a very unwieldy program. I found the answer to use pdfMiner to be the most helpful. It is easy to understand and there exists better documentation. Below is a link for anyone having the same issue as myself.

http://survivalengineer.blogspot.ie/2014/04/parsing-pdfs-in-python.html