I am trying to extract table from pdf. Tabula helped me to extract tables from pdf.
Currently what issue I am facing is, if any table spanning to multiple pages, Tabula considers each new page table content as new table.
Is there any way or logic, to overcome this issue?
from tabula import read_pdf
df = read_pdf("SampleTableFormat2pages.pdf", multiple_tables=True, pages="all")
print len(df)
print df
[ 0 1 2 3 4
0 Label1 Label2 Label3 Label4 Label5
1 Row11 Row12 Row13 Row14 Row15
2 Row21 Row22 Row23 Row24 Row25
3 Row31 Row32 Row33 Row34 Row35, 0 1 2 3 4
0 Row41 Row42 Row43 Row44 Row45
1 Row51 Row52 Row53 Row54 Row55]
Any logic to interpret Tabula to understand table boundry and next page spanning?
OR anyother library which can help on this?
I will suggest going to each page at a time and concat the final table. You can use this function for the number of pages in the pdf
import re
def count_pdf_pages(file_path):
rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
with open(file_path, "rb") as temp_file:
return len(rxcountpages.findall(temp_file.read()))
Now run the loop through each of the pages with the table
for pageiter in range(pages):
df = tabula.read_pdf("SampleTableFormat2pages.pdf",pages=pageiter+1, guess=False)
#If you want to change the table by editing the columns you can do that here.
df_combine=pd.concat([df,df_combine],) #again you can choose between merge or concat as per your need