How to write a partitioned Parquet file using Pandas

Ivan picture Ivan · Oct 22, 2018 · Viewed 9.6k times · Source

I'm trying to write a Pandas dataframe to a partitioned file:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library. How can a partitioned file be written to local disk using pandas?

Answer

ostrokach picture ostrokach · Oct 22, 2018

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120), and pq.write_table does not support writing partitioned datasets. You should use pq.write_to_dataset instead.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation.

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.