Im getting this error when transforming a pandas.DF to parquet using pyArrow:
ArrowInvalid('Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer
To find out which column is the problem I made a new df in a for loop, first with the first column and for each loop adding another column. I realized that the error is in a column of dtype: object
that starts with 0s, I guess that's why pyArrow wants to convert the column to int
but fails because other values are UUID
Im trying to pass a schema: (not sure if this is the way to go)
table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
where schema is: df.dtypes
Carlos have you tried converting the column to one of the pandas types listed here https://arrow.apache.org/docs/python/pandas.html?
Can you post the output of df.dtypes?
If changing the pandas column type doesn't help you can define a pyarrow schema to pass in.
fields = [
pa.field('id', pa.int64()),
pa.field('secondaryid', pa.int64()),
pa.field('date', pa.timestamp('ms')),
]
my_schema = pa.schema(fields)
table = pa.Table.from_pandas(sample_df, schema=my_schema, preserve_index=False)
More information here:
https://arrow.apache.org/docs/python/data.html https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas https://arrow.apache.org/docs/python/generated/pyarrow.schema.html