Error when trying to write DataFrame to feather. Does feather support list columns?

Ben G picture Ben G · Jan 24, 2019 · Viewed 8.1k times · Source

I'm working with both R and Python and I want to write one of my pandas DataFrames as a feather so I can work with it more easily in R. However, when I try to write it as a feather, I get the following error:

ArrowInvalid: trying to convert NumPy type float64 but got float32

I doubled checked my column types and they are already float 64:

In[1]
df.dtypes

Out[1]
id         Object
cluster    int64
vector_x   float64
vector_y   float64

I get the same error regardless of using feather.write_dataframe(df, "path/df.feather") or df.to_feather("path/df.feather").

I saw this on GitHub but didn't understand if it was related or not: https://issues.apache.org/jira/browse/ARROW-1345 and https://github.com/apache/arrow/issues/1430

In the end, I can just save it as a csv and change the columns in R (or just do the whole analysis in Python), but I was hoping to use this.

Edit 1:

Still having the same issue despite the great advice below so updating what I've tried.

df[['vector_x', 'vector_y', 'cluster']] = df[['vector_x', 'vector_y', 'cluster']].astype(float)

df[['doc_id', 'text']] = df[['doc_id', 'text']].astype(str)

df[['doc_vector', 'doc_vectors_2d']] = df[['doc_vector', 'doc_vectors_2d']].astype(list)

df.dtypes

Out[1]:
doc_id           object
text             object
doc_vector       object
cluster          float64
doc_vectors_2d   object
vector_x         float64
vector_y         float64
dtype: object

Edit 2:

After much searching, it appears that the issue is that my cluster column is a list type made up of int64 integers. So I guess the real quest is, does feather format support lists?

Edit 3:

Just to tie this in a bow, feather does not support nested data types like lists, at least not yet.

Answer

Uwe L. Korn picture Uwe L. Korn · Jan 25, 2019

The problem in your case is the id Object column. These are Python objects and they cannot represented in a language neutral format. This feather (actually the underlying Apache Arrow / pyarrow) is trying to guess the DataType of the id column. The guess is done on the first objects it sees in the column. These are float64 numpy scalars. Later, you have float32 scalars. Instead of coercing them to some type, Arrow is more strict with types and fails.

You should be able to work around this problem by ensuring that all columns have a non-object dtype with df['id'] = df['id'].astype(float).