Pandas merge on two columns using date and another column

Hobes picture Hobes · Dec 30, 2017 · Viewed 12.8k times · Source

Hello trying to merge two data frames and sum visit counts by date and upc.

  1. Transaction data (date,upc,sales) 200k rows x 3 columns
  2. Visits counts(date, upc, visit count) 2 million+ rows x 3 columns

I’ve tried this

df3 = pandas.merge(df1,df2, on = ['upc','date'], how = 'left') 

Result: merge executes but it does not sum up by date or upc

I also tried

df3 = pandas.merge(df1,df2, left_on = ['date'], right_on ['upc'] how = 'left')

and that didn’t work.

df3 = pandas.merge(df1,df2, left_on = ['date','upc'], right_on ['date','upc'] how = 'left')

and that didn’t work.

I also tried

df3 = pandas.merge(df1,df2, on = ['date'], how = 'left')

and I kept returning an error message. Based on the error message it looked like I needed to convert one of the dates in the data frames to pandas dtype.

I made that change and returned the same results as my first try. The merge worked but it did not sum up the results. I tried converting both dates in both data frames to astype(str) and that didn’t work. I learned if both dates have the same date format dtype or astype(str) I return a memory error message. Merge would fail.

I was successful with merging using the upc only but this creates an issue in my data, I return duplicate visit numbers because a upc is repeated in the transaction due to the date column.

End of the day what I need is something similar to a sumif function in excel.

I need to combine the two data sets by summarizing the total visits by each upc for each day and keeping transaction data unchanged or left join in terms of sql

Sample data

df1
  Date         upc       sales
0 09/01/2016   A01234    1000
1 09/02/2016   A01234    500
2 09/10/2016   A56789    1200

df2
  Date         upc         visits
0 09/01/2016   A01234      10
1 09/02/2016   A01234      25
2 09/05/2016   A56789      26
3 09/10/2016   A56789      32


df3
  Date         upc       sales   visits
0 09/01/2016   A01234    1000    10
1 09/02/2016   A01234    500     25
2 09/10/2016   A56789    1200    32

Wondering if pandasql package is what I need to use. Any help is appreciated

Answer

PaSTE picture PaSTE · Dec 30, 2017

The first merge statement you perform should get you halfway there, but unfortunately, it's the second half of a two-step process. It sounds like you want to merge the sales data onto the visits data after summing the visits by Date/upc. You actually have to do a sum first (the merge command does not do this by itself). Try:

df2_sum = df2.groupby(["Date", "upc"])["visits"].sum().reset_index()

Then left-merge this onto the sales data:

df3 = pd.merge(df1, df2_sum, on=["Date", "upc"], how="left")