Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.
This is my DataFrame:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id 56992 non-null values
attr1 56992 non-null values
attr2 56992 non-null values
attr3 56992 non-null values
attr4 56992 non-null values
attr5 56992 non-null values
attr6 56992 non-null values
dtypes: int64(2), object(5)
Five of them are dtype object
. I explicitly convert those objects to strings:
for c in df.columns:
if df[c].dtype == object:
print "convert ", df[c].name, " to string"
df[c] = df[c].astype(str)
Then, df["attr2"]
still has dtype object
, although type(df["attr2"].ix[0]
reveals str
, which is correct.
Pandas distinguishes between int64
and float64
and object
. What is the logic behind it when there is no dtype str
? Why is a str
covered by object
?
The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.
Here is an example: