pandas Merge, join, and concatenate What is the difference between join and merge

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Extensions
> Step 2: And Like the video. BONUS: You can also share it!

Example

Consider the dataframes left and right

left = pd.DataFrame([['a', 1], ['b', 2]], list('XY'), list('AB'))
left

   A  B
X  a  1
Y  b  2

right = pd.DataFrame([['a', 3], ['b', 4]], list('XY'), list('AC'))
right

   A  C
X  a  3
Y  b  4

join
Think of join as wanting to combine to dataframes based on their respective indexes. If there are overlapping columns, join will want you to add a suffix to the overlapping column name from left dataframe. Our two dataframes do have an overlapping column name A.

left.join(right, lsuffix='_')

  A_  B  A  C
X  a  1  a  3
Y  b  2  b  4

Notice the index is preserved and we have 4 columns. 2 columns from left and 2 from right.

If the indexes did not align

left.join(right.reset_index(), lsuffix='_', how='outer')

    A_    B index    A    C
0  NaN  NaN     X    a  3.0
1  NaN  NaN     Y    b  4.0
X    a  1.0   NaN  NaN  NaN
Y    b  2.0   NaN  NaN  NaN

I used an outer join to better illustrate the point. If the indexes do not align, the result will be the union of the indexes.

We can tell join to use a specific column in the left dataframe to use as the join key, but it will still use the index from the right.

left.reset_index().join(right, on='index', lsuffix='_')

  index A_  B  A  C
0     X  a  1  a  3
1     Y  b  2  b  4

merge
Think of merge as aligning on columns. By default merge will look for overlapping columns in which to merge on. merge gives better control over merge keys by allowing the user to specify a subset of the overlapping columns to use with parameter on, or to separately allow the specification of which columns on the left and which columns on the right to merge by.

merge will return a combined dataframe in which the index will be destroyed.

This simple example finds the overlapping column to be 'A' and combines based on it.

left.merge(right)

   A  B  C
0  a  1  3
1  b  2  4

Note the index is [0, 1] and no longer ['X', 'Y']

You can explicitly specify that you are merging on the index with the left_index or right_index paramter

left.merge(right, left_index=True, right_index=True, suffixes=['_', ''])

  A_  B  A  C
X  a  1  a  3
Y  b  2  b  4

And this looks exactly like the join example above.



Got any pandas Question?