I just pick up pandas. I have a dataframe as follow:
DEST MONTH PRICE SOUR TYPE YEAR
0 DEST7 8 159 SOUR4 WEEKEND 2015
1 DEST2 9 391 SOUR1 WEEKEND 2010
2 DEST5 5 612 SOUR1 WEEKDAY 2013
3 DEST4 10 836 SOUR4 WEEKEND 2013
4 DEST4 4 689 SOUR3 WEEKEND 2013
5 DEST7 3 862 SOUR4 WEEKDAY 2014
6 DEST4 5 483 SOUR4 WEEKEND 2016
7 DEST2 2 489 SOUR3 WEEKEND 2017
8 DEST4 7 207 SOUR1 WEEKDAY 2012
9 DEST3 11 374 SOUR2 WEEKDAY 2015
10 DEST1 2 959 SOUR2 WEEKEND 2017
11 DEST5 10 969 SOUR3 WEEKDAY 2011
12 DEST8 3 645 SOUR4 WEEKEND 2013
13 DEST6 7 258 SOUR4 WEEKEND 2013
14 DEST8 5 955 SOUR4 WEEKDAY 2010
15 DEST1 3 568 SOUR4 WEEKEND 2013
16 DEST5 5 601 SOUR4 WEEKDAY 2016
17 DEST1 6 159 SOUR3 WEEKDAY 2011
18 DEST3 11 322 SOUR4 WEEKDAY 2013
19 DEST2 10 103 SOUR2 WEEKDAY 2012
I've put the code below, feel free to generate your own random dataframe:
import pandas as pd
import random
import numpy as np
df= pd.DataFrame({"YEAR": np.random.choice([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017], 20, replace=True),
"MONTH": np.random.choice([_ for _ in range(1, 13)], 20, replace=True),
"TYPE": np.random.choice(['WEEKDAY', 'WEEKEND'], 20, replace=True),
"SOUR": np.random.choice(['SOUR1', 'SOUR2', 'SOUR3', 'SOUR4'], 20, replace=True),
"DEST": np.random.choice(['DEST1', 'DEST2', 'DEST3', 'DEST4','DEST5', 'DEST6', 'DEST7', 'DEST8'], 20, replace=True),
"PRICE": np.random.choice([_ for _ in range(100, 999)], 20, replace=True)})
print(df)
I want to generate min, max, mean, median, ...as new columns, add those columns to the dataframe. This is the aggregation code I tried:
aggregation={
"PRICE":
{
"MIN": lambda x: x.min(skipna=True),
"MAX":lambda x: x.max(skipna=True),
"MEDIAN":lambda x: x.median(skipna=True),
"MEAN":lambda x:x.mean(skipna=True)
}
}
df1=df.groupby(["YEAR","MONTH","TYPE","SOUR","DEST"]).agg(aggregation).reset_index()
df1
But the output doesn't calculate any min, max, median, mean at all:
YEAR MONTH TYPE SOUR DEST PRICE
MIN MAX MEDIAN MEAN
0 2010 5 WEEKDAY SOUR4 DEST8 955 955 955 955
1 2010 9 WEEKEND SOUR1 DEST2 391 391 391 391
2 2011 6 WEEKDAY SOUR3 DEST1 159 159 159 159
3 2011 10 WEEKDAY SOUR3 DEST5 969 969 969 969
4 2012 7 WEEKDAY SOUR1 DEST4 207 207 207 207
5 2012 10 WEEKDAY SOUR2 DEST2 103 103 103 103
6 2013 3 WEEKEND SOUR4 DEST1 568 568 568 568
7 2013 3 WEEKEND SOUR4 DEST8 645 645 645 645
8 2013 4 WEEKEND SOUR3 DEST4 689 689 689 689
9 2013 5 WEEKDAY SOUR1 DEST5 612 612 612 612
10 2013 7 WEEKEND SOUR4 DEST6 258 258 258 258
11 2013 10 WEEKEND SOUR4 DEST4 836 836 836 836
12 2013 11 WEEKDAY SOUR4 DEST3 322 322 322 322
13 2014 3 WEEKDAY SOUR4 DEST7 862 862 862 862
14 2015 8 WEEKEND SOUR4 DEST7 159 159 159 159
15 2015 11 WEEKDAY SOUR2 DEST3 374 374 374 374
16 2016 5 WEEKDAY SOUR4 DEST5 601 601 601 601
17 2016 5 WEEKEND SOUR4 DEST4 483 483 483 483
18 2017 2 WEEKEND SOUR2 DEST1 959 959 959 959
19 2017 2 WEEKEND SOUR3 DEST2 489 489 489 489
How could I modify the python code to give correct output? Thanks.
And another question, if I want to add another column which calculate the average price group only by TYPE, SOUR, DEST, (not include MONTH OR YEAR), how to generate if I want to keep the group of TYPE, SOUR, DEST, MONTH, YEAR? My expected output:
YEAR MONTH TYPE SOUR DEST PRICE
MIN MAX MEDIAN MEAN AVG
0 2010 5 WEEKDAY SOUR4 DEST8 ... ... ... ... 500
1 2010 9 WEEKEND SOUR1 DEST2 ... ... ... ...
2 2011 6 WEEKDAY SOUR3 DEST5 ... ... ... ... 720
3 2011 10 WEEKDAY SOUR3 DEST5 ... ... ... ... 720
4 2012 7 WEEKDAY SOUR1 DEST4 ... ... ... ...
5 2012 10 WEEKDAY SOUR2 DEST2 ... ... ... ...
6 2013 3 WEEKEND SOUR4 DEST1 ... ... ... ...
7 2013 3 WEEKDAY SOUR4 DEST8 ... ... ... ... 500
8 2013 4 WEEKEND SOUR3 DEST4 ... ... ... ...
9 2013 5 WEEKDAY SOUR1 DEST5 ... ... ... ...
10 2013 7 WEEKEND SOUR4 DEST6 ... ... ... ...
...
You're code actually does calculate the min, max, median and mean. However, since your using groupby on 5 columns. The chance of 2 rows containing the same values for all 5 columns with only 20 rows is very little.
Either increase the amount of data, so the groupby actually groups rows together, or groupby on less columns at a time.
To add a column with the AVG (mean) using only 3 columns for the groupby, do the groupby on the first DataFrame seperately and merge them on the three columns.
df1=df.groupby(["YEAR","MONTH","TYPE","SOUR","DEST"]).agg(aggregation).reset_index()
df2=df.groupby(["TYPE", "SOUR", "DEST"]).agg({"PRICE":{ "avg" : "mean"} } ).reset_index()
df3= pd.merge(df1, df2, on=["TYPE", "SOUR", "DEST"], how='left')