I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :
ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data:
I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file.
How can I white the parquet file with used schema after concat?
Here is my code:
import pandas as pd
table1 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table2 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table = pd.concat([table1, table2], ignore_index=True)
table.to_parquet('./file.gzip', compression='gzip')
Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22
. As such, using table.to_parquet(allow_truncated_timestamps=True)
should work - I verified it for pandas v0.25.0
and pyarrow 0.13.0
. For more keywords see the pyarrow docs.