Assign schema to pa.Table.from_pandas()

Carlos P Ceballos picture Carlos P Ceballos · Mar 30, 2018 · Viewed 8k times · Source

Im getting this error when transforming a pandas.DF to parquet using pyArrow:

ArrowInvalid('Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer

To find out which column is the problem I made a new df in a for loop, first with the first column and for each loop adding another column. I realized that the error is in a column of dtype: object that starts with 0s, I guess that's why pyArrow wants to convert the column to int but fails because other values are UUID

Im trying to pass a schema: (not sure if this is the way to go)

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

where schema is: df.dtypes

Answer

Alexander picture Alexander · Mar 30, 2018

Carlos have you tried converting the column to one of the pandas types listed here https://arrow.apache.org/docs/python/pandas.html?

Can you post the output of df.dtypes?

If changing the pandas column type doesn't help you can define a pyarrow schema to pass in.

fields = [
    pa.field('id', pa.int64()),
    pa.field('secondaryid', pa.int64()),
    pa.field('date', pa.timestamp('ms')),
]

my_schema = pa.schema(fields)

table = pa.Table.from_pandas(sample_df, schema=my_schema, preserve_index=False)

More information here:

https://arrow.apache.org/docs/python/data.html https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas https://arrow.apache.org/docs/python/generated/pyarrow.schema.html