I am trying to understand pandas
MultiIndex
DataFrame
s and how to assign data to them. Specifically I'm interested in assigning entire blocks that match another smaller data frame.
ix = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b', 'c', 'd']])
df = pd.DataFrame(index=ix, columns=['1st', '2nd', '3rd'], dtype=np.float64)
df_ = pd.DataFrame(index=['a', 'b', 'c', 'd'], columns=['1st', '2nd', '3rd'], data=np.random.rand(4, 3))
df_
1st 2nd 3rd
a 0.730251 0.468134 0.876926
b 0.104990 0.082461 0.129083
c 0.993608 0.117799 0.341811
d 0.784950 0.840145 0.016777
df
is the same except that all the values are NaN
and there are two blocks A
and B
. Now if I want to assign the values from df_
to df
I would imagine I can do something like
df.loc['A',:] = df_ # Runs, does not work
df.loc[('A','a'):('A','d')] = df_ # AssertionError (??) 'Start slice bound is non-scalar'
df.loc[('A','a'):('A','d')] # No AssertionError (??)
idx = pd.IndexSlice
df.loc[idx['A', :]] = df_ # Runs, does not work
None of these work, they leave all the values in df
as NaN
, although df.loc[idx['A', :]]
gives me a slice of the data frame that exactly matches that of the sub frame (df_
). So is this a case of setting values on a view? Explicitly iterating over the index in df_
works
# this is fine
for v in df_.index:
df.loc[idx['A', v]] = df_.loc[v]
# this is also fine
for v in df_.index:
df.loc['A', v] = df_.loc[v]
Is it even possible to assign whole blocks like this (sort of like NumPy
)? If not, that's fine, I am simply trying to understand how the system works.
There's a related question about index slicers, but it's about assigning a single value to a masked portion of the DataFrame
, not about assigning blocks.
Pandas : Proper way to set values based on condition for subset of multiindex dataframe
When you use
df.loc['A', :] = df_
Pandas tries to align the index of df_
with the index of a sub-DataFrame of
df
. However, at the point in the code where alignment is performed, the
sub-DataFrame has a MultiIndex, not the single index you see as the result
of df.loc['A', :]
.
So the alignment fails because df_
has a single index, not the MultiIndex that
is needed. To see that the index of df_
is indeed the problem, note that
ix_ = pd.MultiIndex.from_product([['A'], ['a', 'b', 'c', 'd']])
df_.index = ix_
df.loc['A', :] = df_
print(df)
succeeds, yielding something like
A a 0.229970 0.730824 0.784356
b 0.584390 0.628337 0.318222
c 0.257192 0.624273 0.221279
d 0.787023 0.056342 0.240735
B a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
d NaN NaN NaN
Of course, you probably do not want to have to create a new MultiIndex every time you want to assign a block of values. So instead, to work around this alignment problem, you can use a NumPy array as the assignment value:
df.loc['A', :] = df_.values
Since df_.values
is a NumPy array and an array has no index, no alignment is
performed
and the assignment yields the same result as above. This trick of using a NumPy arrays when you don't want alignment of indexes
applies to many situations when using Pandas.
Note also that assignment-by-NumPy-array can also help you perform more complicated assignments such as to rows which are not contiguous:
idx = pd.IndexSlice
df.loc[idx[:,('a','b')], :] = df_.values
yields
In [85]: df
Out[85]:
1st 2nd 3rd
A a 0.229970 0.730824 0.784356
b 0.584390 0.628337 0.318222
c NaN NaN NaN
d NaN NaN NaN
B a 0.257192 0.624273 0.221279
b 0.787023 0.056342 0.240735
c NaN NaN NaN
d NaN NaN NaN
for example.