How can we use pandas to generate min, max, mean, median, ...as new columns for the dataframe?

Haven Shi picture Haven Shi · Oct 20, 2017 · Viewed 8.7k times · Source

I just pick up pandas. I have a dataframe as follow:

     DEST  MONTH  PRICE   SOUR     TYPE  YEAR
0   DEST7      8    159  SOUR4  WEEKEND  2015
1   DEST2      9    391  SOUR1  WEEKEND  2010
2   DEST5      5    612  SOUR1  WEEKDAY  2013
3   DEST4     10    836  SOUR4  WEEKEND  2013
4   DEST4      4    689  SOUR3  WEEKEND  2013
5   DEST7      3    862  SOUR4  WEEKDAY  2014
6   DEST4      5    483  SOUR4  WEEKEND  2016
7   DEST2      2    489  SOUR3  WEEKEND  2017
8   DEST4      7    207  SOUR1  WEEKDAY  2012
9   DEST3     11    374  SOUR2  WEEKDAY  2015
10  DEST1      2    959  SOUR2  WEEKEND  2017
11  DEST5     10    969  SOUR3  WEEKDAY  2011
12  DEST8      3    645  SOUR4  WEEKEND  2013
13  DEST6      7    258  SOUR4  WEEKEND  2013
14  DEST8      5    955  SOUR4  WEEKDAY  2010
15  DEST1      3    568  SOUR4  WEEKEND  2013
16  DEST5      5    601  SOUR4  WEEKDAY  2016
17  DEST1      6    159  SOUR3  WEEKDAY  2011
18  DEST3     11    322  SOUR4  WEEKDAY  2013
19  DEST2     10    103  SOUR2  WEEKDAY  2012

I've put the code below, feel free to generate your own random dataframe:

import pandas as pd
import random
import numpy as np

df= pd.DataFrame({"YEAR": np.random.choice([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017], 20, replace=True),
                  "MONTH": np.random.choice([_ for _ in range(1, 13)], 20, replace=True),
                  "TYPE": np.random.choice(['WEEKDAY', 'WEEKEND'], 20, replace=True),
                  "SOUR": np.random.choice(['SOUR1', 'SOUR2', 'SOUR3', 'SOUR4'], 20, replace=True),
                  "DEST": np.random.choice(['DEST1', 'DEST2', 'DEST3', 'DEST4','DEST5', 'DEST6', 'DEST7', 'DEST8'], 20, replace=True),
                  "PRICE": np.random.choice([_ for _ in range(100, 999)], 20, replace=True)})
print(df)

I want to generate min, max, mean, median, ...as new columns, add those columns to the dataframe. This is the aggregation code I tried:

aggregation={
         "PRICE":
    {
        "MIN": lambda x: x.min(skipna=True),
        "MAX":lambda x: x.max(skipna=True),
        "MEDIAN":lambda x: x.median(skipna=True),
        "MEAN":lambda x:x.mean(skipna=True)
    }
}

df1=df.groupby(["YEAR","MONTH","TYPE","SOUR","DEST"]).agg(aggregation).reset_index()
df1

But the output doesn't calculate any min, max, median, mean at all:

    YEAR MONTH     TYPE   SOUR   DEST PRICE                 
                                        MIN  MAX MEDIAN MEAN
0   2010     5  WEEKDAY  SOUR4  DEST8   955  955    955  955
1   2010     9  WEEKEND  SOUR1  DEST2   391  391    391  391
2   2011     6  WEEKDAY  SOUR3  DEST1   159  159    159  159
3   2011    10  WEEKDAY  SOUR3  DEST5   969  969    969  969
4   2012     7  WEEKDAY  SOUR1  DEST4   207  207    207  207
5   2012    10  WEEKDAY  SOUR2  DEST2   103  103    103  103
6   2013     3  WEEKEND  SOUR4  DEST1   568  568    568  568
7   2013     3  WEEKEND  SOUR4  DEST8   645  645    645  645
8   2013     4  WEEKEND  SOUR3  DEST4   689  689    689  689
9   2013     5  WEEKDAY  SOUR1  DEST5   612  612    612  612
10  2013     7  WEEKEND  SOUR4  DEST6   258  258    258  258
11  2013    10  WEEKEND  SOUR4  DEST4   836  836    836  836
12  2013    11  WEEKDAY  SOUR4  DEST3   322  322    322  322
13  2014     3  WEEKDAY  SOUR4  DEST7   862  862    862  862
14  2015     8  WEEKEND  SOUR4  DEST7   159  159    159  159
15  2015    11  WEEKDAY  SOUR2  DEST3   374  374    374  374
16  2016     5  WEEKDAY  SOUR4  DEST5   601  601    601  601
17  2016     5  WEEKEND  SOUR4  DEST4   483  483    483  483
18  2017     2  WEEKEND  SOUR2  DEST1   959  959    959  959
19  2017     2  WEEKEND  SOUR3  DEST2   489  489    489  489

How could I modify the python code to give correct output? Thanks.

And another question, if I want to add another column which calculate the average price group only by TYPE, SOUR, DEST, (not include MONTH OR YEAR), how to generate if I want to keep the group of TYPE, SOUR, DEST, MONTH, YEAR? My expected output:

    YEAR MONTH     TYPE   SOUR   DEST PRICE                 
                                        MIN  MAX MEDIAN MEAN AVG
0   2010     5  WEEKDAY  SOUR4  DEST8   ...  ... ...    ...  500
1   2010     9  WEEKEND  SOUR1  DEST2   ...  ... ...    ...  
2   2011     6  WEEKDAY  SOUR3  DEST5   ...  ... ...    ...  720
3   2011    10  WEEKDAY  SOUR3  DEST5   ...  ... ...    ...  720
4   2012     7  WEEKDAY  SOUR1  DEST4   ...  ... ...    ...  
5   2012    10  WEEKDAY  SOUR2  DEST2   ...  ... ...    ...  
6   2013     3  WEEKEND  SOUR4  DEST1   ...  ... ...    ...  
7   2013     3  WEEKDAY  SOUR4  DEST8   ...  ... ...    ...  500  
8   2013     4  WEEKEND  SOUR3  DEST4   ...  ... ...    ...  
9   2013     5  WEEKDAY  SOUR1  DEST5   ...  ... ...    ...  
10  2013     7  WEEKEND  SOUR4  DEST6   ...  ... ...    ...  
... 

Answer

Prikkel picture Prikkel · Oct 20, 2017

You're code actually does calculate the min, max, median and mean. However, since your using groupby on 5 columns. The chance of 2 rows containing the same values for all 5 columns with only 20 rows is very little.

Either increase the amount of data, so the groupby actually groups rows together, or groupby on less columns at a time.

To add a column with the AVG (mean) using only 3 columns for the groupby, do the groupby on the first DataFrame seperately and merge them on the three columns.

df1=df.groupby(["YEAR","MONTH","TYPE","SOUR","DEST"]).agg(aggregation).reset_index()
df2=df.groupby(["TYPE", "SOUR", "DEST"]).agg({"PRICE":{ "avg" : "mean"} } ).reset_index()
df3= pd.merge(df1, df2, on=["TYPE", "SOUR", "DEST"], how='left')