Tabula-py is not splitting columns right

giga picture giga · Nov 17, 2017 · Viewed 10.3k times · Source

I've just discovered the joy of tabula-py (and tabula-java of course) to extract tables from pdf. I am now programming a script for my job that reads some data from the pdf table, cleans it a little bit and the export that into excel. The pdf I am using has the same format every day, and the table is always in a certain area. To detect the area, I am using tabula.exe: I select the table, visualize the preview (which looks good), and then export the script, in order to see the -a parameter that is used by tabula.exe. I then use this in my command in Python, that is:

df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1',
stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, pandas_options={'header':None})

I am using the encoding parameter because the standard utf-8 returns an error, and the stream method, because it's the one that shows a nice extracted table in tabula.exe. However, the dataframe has a problem, because the first 2 columns (which are displayed correctly as 2 different columns in the preview of tabula.exe) are actually one single column, so that names and values get mixed together.

Do you have any idea of why the same area yields 2 different results in tabula-py and tabula.exe? Thank you very much!

Answer

giga picture giga · Nov 18, 2017

Figured it out on GitHub: tabula-py has the "guess" option set on True by default. So to correct the discrepancy, you can just add guess=False, and the output will be the same!

    df = tabula.read_pdf(os.fsdecode(directory)+filename, encoding = 'ISO-8859-1', 
         stream=True, area = "81.106,302.475,384.697,552.491", pages = 2, guess = False,  pandas_options={'header':None})