Can one perform a left join in pandas that selects only the first match on the right?

Quant picture Quant · Oct 8, 2014 · Viewed 10.7k times · Source

Can one perform a left join in pandas that selects only the first match on the right? Example:

left            = pd.DataFrame()
left['age']     = [11, 12]
right           = pd.DataFrame()
right['age']    = [10, 11, 11]
right['salary'] = [ 100, 150, 200 ]
left.merge( right, how='left', on='age' )

Returns

   age  salary
0   11     150
1   11     200
2   12     NaN

But what I would like is to preserve the number of rows of left, by merely taking the first match. That is:

   age  salary
0   11     150
2   12     NaN

So I've been using

left.merge( right.drop_duplicates(['age']), how='left', on='age')

but I believe this makes a full copy of right. And it smells funny.

Is there a more elegant way?

Answer

samus picture samus · Oct 9, 2017

Yes, you can use groupby to remove your duplicate lines. Do everything you've done to define left and right. Now, I define a new dataframe on your last line:

left2=left.merge( right, how='left', on='age' )
df= left2.groupby(['age'])['salary'].first().reset_index()
df

At first I used a .min(), which will give you the minimum salary at each age, as such:

df= left2.groupby(['age'])['salary'].min().reset_index()

But you were specifically asking about the first match. To do so you use the .first() option. Note: The .reset_index() at the end, just reformats the output of the groupby to be a dataframe again.