Hello trying to merge two data frames and sum visit counts by date and upc.
I’ve tried this
df3 = pandas.merge(df1,df2, on = ['upc','date'], how = 'left')
Result: merge executes but it does not sum up by date or upc
I also tried
df3 = pandas.merge(df1,df2, left_on = ['date'], right_on ['upc'] how = 'left')
and that didn’t work.
df3 = pandas.merge(df1,df2, left_on = ['date','upc'], right_on ['date','upc'] how = 'left')
and that didn’t work.
I also tried
df3 = pandas.merge(df1,df2, on = ['date'], how = 'left')
and I kept returning an error message. Based on the error message it looked like I needed to convert one of the dates in the data frames to pandas dtype.
I made that change and returned the same results as my first try. The merge worked but it did not sum up the results. I tried converting both dates in both data frames to astype(str)
and that didn’t work. I learned if both dates have the same date format dtype or astype(str)
I return a memory error message. Merge would fail.
I was successful with merging using the upc only but this creates an issue in my data, I return duplicate visit numbers because a upc is repeated in the transaction due to the date column.
End of the day what I need is something similar to a sumif function in excel.
I need to combine the two data sets by summarizing the total visits by each upc for each day and keeping transaction data unchanged or left join in terms of sql
Sample data
df1
Date upc sales
0 09/01/2016 A01234 1000
1 09/02/2016 A01234 500
2 09/10/2016 A56789 1200
df2
Date upc visits
0 09/01/2016 A01234 10
1 09/02/2016 A01234 25
2 09/05/2016 A56789 26
3 09/10/2016 A56789 32
df3
Date upc sales visits
0 09/01/2016 A01234 1000 10
1 09/02/2016 A01234 500 25
2 09/10/2016 A56789 1200 32
Wondering if pandasql package is what I need to use. Any help is appreciated
The first merge statement you perform should get you halfway there, but unfortunately, it's the second half of a two-step process. It sounds like you want to merge the sales data onto the visits data after summing the visits by Date/upc. You actually have to do a sum first (the merge
command does not do this by itself). Try:
df2_sum = df2.groupby(["Date", "upc"])["visits"].sum().reset_index()
Then left-merge this onto the sales data:
df3 = pd.merge(df1, df2_sum, on=["Date", "upc"], how="left")