Transfer and write Parquet with python and pandas got timestamp error

Neil Su picture Neil Su · Dec 22, 2018 · Viewed 8.4k times · Source

I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :

 ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data:

I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file.
How can I white the parquet file with used schema after concat?
Here is my code:

import pandas as pd

table1 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table2 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')

table = pd.concat([table1, table2], ignore_index=True) 
table.to_parquet('./file.gzip', compression='gzip')

Answer

Axel picture Axel · Jul 30, 2019

Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.