I want to extract text from pdf file using Python and PYPDF package. This is my pdf fie and this is my code:
import PyPDF2
opened_pdf = PyPDF2.PdfFileReader('test.pdf', 'rb')
p=opened_pdf.getPage(0)
p_text= p.extractText()
# extract data line by line
P_lines=p_text.splitlines()
print P_lines
My problem is P_lines cannot extract data line by line and results in one giant string. I want to extract text line by line to analyze it. Any suggestion on how to improve it? Thanks! This is the string that code returns:
[u'Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)** Information is based on the maximum potential for concentration and thus the total may be over 100%* Total Water Volume sources may include fresh water, produced water, and/or recycled water0.01271%72.00%7732-18-5Water0.00071%4.00%1310-73-2Sodium Hydroxide0.00424%24.00%533-74-4DazomatBiocidePumpcoPlexcide 24L0.00828%75.00%Organic phosphonic acid salts0.00276%25.00%67-56-1Methyl AlcoholScale InhibitorPumpcoPlexaid 6730.00807%30.00%7732-18-5Water0.00188%7.00%Polyethoxylated alcohol surfactants0.00753%28.00%9003-06-9Ammonium Salts0.00941%35.00%64742-47-8Petroleum DistillateFriction ReducerPumpcoPlexslick 9210.05029%60.00%7732-18-5Water0.03353%40.00%7647-01-0Hydrogen ChlorideHydrochloric AcidPumpcoHCL9.84261%100.00%14808-60-7Crystaline SilicaProppantPumpcoSand90.01799%100.00%7732-18-5WaterCommentsMaximumIngredientConcentrationin HF Fluid(% by mass)**MaximumIngredientConcentrationin Additive(% by mass)**Chemical AbstractService Number(CAS #)IngredientsPurposeSupplierTrade NameHydraulic Fracturing Fluid Composition:2,608,032Total Water Volume (gal)*:7,595True Vertical Depth (TVD):GasProduction Type:NAD27Long/Lat Projection:32.558525Latitude:-97.215242Longitude:Ole Gieser Unit D 6HWell Name and Number:XTO EnergyOperator Name:42-439-35084API Number:TarrantCounty:TexasState:12/10/2010Fracture DateHydraulic Fracturing Fluid Product Component Information Disclosure']
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))
Output
Hydraulic Fracturing Fluid Product Component Information Disclosure
Fracture Date State: County: API Number: Operator Name: Well Name and Number: Longitude: Latitude: Long/Lat Projection: Production Type: True Vertical Depth (TVD): Total Water Volume (gal)*:
12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H -97.215242 32.558525 NAD27 Gas 7,595 2,608,032
Hydraulic Fracturing Fluid Composition:
Trade Name
Supplier
Purpose
Ingredients
Chemical Abstract Service Number
(CAS #)
Maximum Ingredient
Concentration
in Additive ( by mass)**
Comments
Maximum Ingredient
Concentration
in HF Fluid ( by mass)**
Water Sand HCL
Pumpco Pumpco
Proppant Hydrochloric Acid
Plexslick 921
Pumpco
Friction Reducer
Plexaid 673
Pumpco
Scale Inhibitor
Plexcide 24L
Pumpco
Biocide
Crystaline Silica
Hydrogen Chloride Water
Petroleum Distillate Ammonium Salts Polyethoxylated alcohol surfactants Water
Methyl Alcohol Organic phosphonic acid salts
Dazomat Sodium Hydroxide Water
7732-18-5 14808-60-7
7647-01-0 7732-18-5
64742-47-8 9003-06-9
7732-18-5
67-56-1
533-74-4 1310-73-2 7732-18-5
100.00 100.00
90.01799 9.84261
40.00 60.00
35.00 28.00 7.00 30.00
25.00 75.00
24.00 4.00 72.00
0.03353 0.05029
0.00941 0.00753 0.00188 0.00807
0.00276 0.00828
0.00424 0.00071 0.01271
- Total Water Volume sources may include fresh water, produced water, and/or recycled water ** Information is based on the maximum potential for concentration and thus the total may be over 100
Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)