Best way to subset a pandas dataframe

Pierre-Eric Garcia picture Pierre-Eric Garcia · Jan 21, 2018 · Viewed 9.6k times · Source

Hey I'm new to Pandas and I just came across df.query().

Why people would use df.query() when you can directly filter your Dataframes using brackets notation ? The official pandas tutorial also seems to prefer the latter approach.

With brackets notation :

df[df['age'] <= 21]

With pandas query method :

df.query('age <= 21')

Besides some of the stylistic or flexibility differences that have been mentioned, is one canonically preferred - namely for performance of operations on large dataframes?

Answer

MaxU picture MaxU · Jan 21, 2018

Consider the following sample DF:

In [307]: df
Out[307]:
  sex  age     name
0   M   40      Max
1   F   35     Anna
2   M   29      Joe
3   F   18    Maria
4   F   23  Natalie

There are quite a few good reasons to prefer .query() method.

  • it might be much shorter and cleaner compared to boolean indexing:

    In [308]: df.query("20 <= age <= 30 and sex=='F'")
    Out[308]:
      sex  age     name
    4   F   23  Natalie
    
    In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')]
    Out[309]:
      sex  age     name
    4   F   23  Natalie
    
  • you can prepare conditions (queries) programmatically:

    In [315]: conditions = {'name':'Joe', 'sex':'M'}
    
    In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()])
    
    In [317]: q
    Out[317]: 'name=="Joe" and sex=="M"'
    
    In [318]: df.query(q)
    Out[318]:
      sex  age name
    2   M   29  Joe
    

PS there are also some disadvantages:

  • we can't use .query() method for columns containing spaces or columns that consist only from digits
  • not all functions can be applied or in some cases we have to use engine='python' instead of default engine='numexpr' (which is faster)

NOTE: Jeff (one of the main Pandas contributors and a member of Pandas core team) once said:

Note that in reality .query is just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.