I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables.
The top of the first table table_a looks like this:
COMPANY_ID DATE MEASURE
1 2010-01-01 00:00:00 10
1 2010-01-02 00:00:00 10
1 2010-01-03 00:00:00 10
1 2010-01-04 00:00:00 10
1 2010-01-05 00:00:00 10
Here is the code to create the table:
table_a = pd.concat(\
[pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
'COMPANY_ID': 1 , 'MEASURE': 10}),\
pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
'COMPANY_ID': 2 , 'MEASURE': 10})])
The second table, table_b looks like this:
COMPANY END_DATE
1 2010-03-01 00:00:00
1 2010-06-02 00:00:00
2 2010-03-01 00:00:00
2 2010-06-02 00:00:00
and the code to create it is:
table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\
'COMPANY':(1,1,2,2)})
I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b.
This is (I think) the SQL equivalent:
select
b.COMPANY_ID,
b.DATE
sum(a.MEASURE) AS MEASURE_TO_END_DATE
from table_a a, table_b b
where a.COMPANY = b.COMPANY and
a.DATE < b.DATE and
a.DATE > b.DATE - 30
group by b.COMPANY;
Thanks for any help
Well, I can think of a few ways:
company
)... then filter on the 30-day windows after the merge.groupby()
.Option #1
Suppose your data looks like the following (I expanded your sample data):
print df
company date measure
0 0 2010-01-01 10
1 0 2010-01-15 10
2 0 2010-02-01 10
3 0 2010-02-15 10
4 0 2010-03-01 10
5 0 2010-03-15 10
6 0 2010-04-01 10
7 1 2010-03-01 5
8 1 2010-03-15 5
9 1 2010-04-01 5
10 1 2010-04-15 5
11 1 2010-05-01 5
12 1 2010-05-15 5
print windows
company end_date
0 0 2010-02-01
1 0 2010-03-15
2 1 2010-04-01
3 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print windows
company end_date beg_date
0 0 2010-02-01 2010-01-02
1 0 2010-03-15 2010-02-13
2 1 2010-04-01 2010-03-02
3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date
falls within beg_date
and end_date
:
df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df
company date measure end_date beg_date
2 0 2010-01-15 10 2010-02-01 2010-01-02
4 0 2010-02-01 10 2010-02-01 2010-01-02
7 0 2010-02-15 10 2010-03-15 2010-02-13
9 0 2010-03-01 10 2010-03-15 2010-02-13
11 0 2010-03-15 10 2010-03-15 2010-02-13
16 1 2010-03-15 5 2010-04-01 2010-03-02
18 1 2010-04-01 5 2010-04-01 2010-03-02
21 1 2010-04-15 5 2010-05-15 2010-04-15
23 1 2010-05-01 5 2010-05-15 2010-04-15
25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company
and end_date
:
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
def cond_merge(g,windows):
g = g.merge(windows,on='company',how='left')
g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
return g.groupby('end_date')['measure'].sum()
print df.groupby('company').apply(cond_merge,windows)
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date']
df = df.merge(windows,on=['company','date'],how='outer')
print df
company date measure end_date
0 0 2010-01-01 10 NaT
1 0 2010-01-15 10 NaT
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 NaT
4 0 2010-03-01 10 NaT
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 NaT
8 1 2010-03-15 5 NaT
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 NaT
11 1 2010-05-01 5 NaT
12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())
print df
company date measure end_date
0 0 2010-01-01 10 2010-02-01
1 0 2010-01-15 10 2010-02-01
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 2010-03-15
4 0 2010-03-01 10 2010-03-15
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 2010-04-01
8 1 2010-03-15 5 2010-04-01
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 2010-05-15
11 1 2010-05-01 5 2010-05-15
12 1 2010-05-15 5 2010-05-15
df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print df
company date measure end_date beg_date
0 0 2010-01-01 10 2010-02-01 2010-01-02
1 0 2010-01-15 10 2010-02-01 2010-01-02
2 0 2010-02-01 10 2010-02-01 2010-01-02
3 0 2010-02-15 10 2010-03-15 2010-02-13
4 0 2010-03-01 10 2010-03-15 2010-02-13
5 0 2010-03-15 10 2010-03-15 2010-02-13
7 1 2010-03-01 5 2010-04-01 2010-03-02
8 1 2010-03-15 5 2010-04-01 2010-03-02
9 1 2010-04-01 5 2010-04-01 2010-03-02
10 1 2010-04-15 5 2010-05-15 2010-04-15
11 1 2010-05-01 5 2010-05-15 2010-04-15
12 1 2010-05-15 5 2010-05-15 2010-04-15
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.