I have read a number of other stackoverflow answers and have yet to find a satisfactory answer to this, but it has been asked before. When I attempt to use PyPDF2 to read pdf documents it merges all of the words in a sentences into one continous string. Has anyone made any progess in figuring out how to avoid this. Below is the code
import PyPDF2
import pandas as pd
import struct as struct
from nltk import word_tokenize
pdfFileObj = open("notes.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
## reading pages fine
print(type(pdfReader.numPages))
## read in the pages
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
below is a sample of the output
2)Explanationofthedifferencebetweenprobabilityandstatistics.Theroleofprobability
instatisticaldecisionmaking.ExamplesoftheuseofProbabilityinStatistics.
3)Datasummarization(graphicalandnumerical)
4)Probabilityandrandomvariables
Never figured out how to remove the spaces, it is a very unwieldy program. I found the answer to use pdfMiner to be the most helpful. It is easy to understand and there exists better documentation. Below is a link for anyone having the same issue as myself.
http://survivalengineer.blogspot.ie/2014/04/parsing-pdfs-in-python.html