I have a dataframe df
in pandas that was built using pandas.read_table
from a csv file. The dataframe has several columns and it is indexed by one of the columns (which is unique, in that each row has a unique value for that column used for indexing.)
How can I select rows of my dataframe based on a "complex" filter applied to multiple columns? I can easily select out the slice of the dataframe where column colA
is greater than 10 for example:
df_greater_than10 = df[df["colA"] > 10]
But what if I wanted a filter like: select the slice of df
where any of the columns are greater than 10?
Or where the value for colA
is greater than 10 but the value for colB
is less than 5?
How are these implemented in pandas? Thanks.
I encourage you to pose these questions on the mailing list, but in any case, it's still a very much low level affair working with the underlying NumPy arrays. For example, to select rows where the value in any column exceed, say, 1.5 in this example:
In [11]: df
Out[11]:
A B C D
2000-01-03 -0.59885 -0.18141 -0.68828 -0.77572
2000-01-04 0.83935 0.15993 0.95911 -1.12959
2000-01-05 2.80215 -0.10858 -1.62114 -0.20170
2000-01-06 0.71670 -0.26707 1.36029 1.74254
2000-01-07 -0.45749 0.22750 0.46291 -0.58431
2000-01-10 -0.78702 0.44006 -0.36881 -0.13884
2000-01-11 0.79577 -0.09198 0.14119 0.02668
2000-01-12 -0.32297 0.62332 1.93595 0.78024
2000-01-13 1.74683 -1.57738 -0.02134 0.11596
2000-01-14 -0.55613 0.92145 -0.22832 1.56631
2000-01-17 -0.55233 -0.28859 -1.18190 -0.80723
2000-01-18 0.73274 0.24387 0.88146 -0.94490
2000-01-19 0.56644 -0.49321 1.17584 -0.17585
2000-01-20 1.56441 0.62331 -0.26904 0.11952
2000-01-21 0.61834 0.17463 -1.62439 0.99103
2000-01-24 0.86378 -0.68111 -0.15788 -0.16670
2000-01-25 -1.12230 -0.16128 1.20401 1.08945
2000-01-26 -0.63115 0.76077 -0.92795 -2.17118
2000-01-27 1.37620 -1.10618 -0.37411 0.73780
2000-01-28 -1.40276 1.98372 1.47096 -1.38043
2000-01-31 0.54769 0.44100 -0.52775 0.84497
2000-02-01 0.12443 0.32880 -0.71361 1.31778
2000-02-02 -0.28986 -0.63931 0.88333 -2.58943
2000-02-03 0.54408 1.17928 -0.26795 -0.51681
2000-02-04 -0.07068 -1.29168 -0.59877 -1.45639
2000-02-07 -0.65483 -0.29584 -0.02722 0.31270
2000-02-08 -0.18529 -0.18701 -0.59132 -1.15239
2000-02-09 -2.28496 0.36352 1.11596 0.02293
2000-02-10 0.51054 0.97249 1.74501 0.20525
2000-02-11 0.10100 0.27722 0.65843 1.73591
In [12]: df[(df.values > 1.5).any(1)]
Out[12]:
A B C D
2000-01-05 2.8021 -0.1086 -1.62114 -0.2017
2000-01-06 0.7167 -0.2671 1.36029 1.7425
2000-01-12 -0.3230 0.6233 1.93595 0.7802
2000-01-13 1.7468 -1.5774 -0.02134 0.1160
2000-01-14 -0.5561 0.9215 -0.22832 1.5663
2000-01-20 1.5644 0.6233 -0.26904 0.1195
2000-01-28 -1.4028 1.9837 1.47096 -1.3804
2000-02-10 0.5105 0.9725 1.74501 0.2052
2000-02-11 0.1010 0.2772 0.65843 1.7359
Multiple conditions have to be combined using &
or |
(and parentheses!):
In [13]: df[(df['A'] > 1) | (df['B'] < -1)]
Out[13]:
A B C D
2000-01-05 2.80215 -0.1086 -1.62114 -0.2017
2000-01-13 1.74683 -1.5774 -0.02134 0.1160
2000-01-20 1.56441 0.6233 -0.26904 0.1195
2000-01-27 1.37620 -1.1062 -0.37411 0.7378
2000-02-04 -0.07068 -1.2917 -0.59877 -1.4564
I'd be very interested to have some kind of query API to make these kinds of things easier