Difference(s) between merge() and concat() in pandas

WindChimes picture WindChimes · Jul 8, 2016 · Viewed 65.3k times · Source

What's the essential difference(s) between pd.DataFrame.merge() and pd.concat()?

So far, this is what I found, please comment on how complete and accurate my understanding is:

  • .merge() can only use columns (plus row-indices) and it is semantically suitable for database-style operations. .concat() can be used with either axis, using only indices, and gives the option for adding a hierarchical index.

  • Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.

  • pd.DataFrame.join() merely offers a shorthand for a subset of the use cases of .merge()

(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )

Answer

Abhishek Sawant picture Abhishek Sawant · Mar 11, 2017

A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

For example:

df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})

df1:
   Key  data1
0   b   0
1   b   1
2   a   2
3   c   3
4   a   4
5   a   5
6   b   6

df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})

df2:
    Key data2
0   a   0
1   b   1
2   d   2

#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is 
# a common column in 2 dataframes

pd.merge(df1, df2)

   Key data1 data2
0   b    0    1
1   b    1    1
2   b    6    1
3   a    2    0
4   a    4    0
5   a    5    0

#Concat
# df2 dataframe is appended at the bottom of df1 

pd.concat([df1, df2])

   Key data1 data2
0   b   0     NaN
1   b   1     NaN
2   a   2     NaN
3   c   3     NaN
4   a   4     NaN
5   a   5     NaN
6   b   6     NaN
0   a   Nan   0
1   b   Nan   1
2   d   Nan   2