Consider the dataframes left and right
left = pd.DataFrame([['a', 1], ['b', 2]], list('XY'), list('AB'))
left
A B
X a 1
Y b 2
right = pd.DataFrame([['a', 3], ['b', 4]], list('XY'), list('AC'))
right
A C
X a 3
Y b 4
join
Think of join as wanting to combine to dataframes based on their respective indexes. If there are overlapping columns, join will want you to add a suffix to the overlapping column name from left dataframe. Our two dataframes do have an overlapping column name A.
left.join(right, lsuffix='_')
A_ B A C
X a 1 a 3
Y b 2 b 4
Notice the index is preserved and we have 4 columns. 2 columns from left and 2 from right.
If the indexes did not align
left.join(right.reset_index(), lsuffix='_', how='outer')
A_ B index A C
0 NaN NaN X a 3.0
1 NaN NaN Y b 4.0
X a 1.0 NaN NaN NaN
Y b 2.0 NaN NaN NaN
I used an outer join to better illustrate the point. If the indexes do not align, the result will be the union of the indexes.
We can tell join to use a specific column in the left dataframe to use as the join key, but it will still use the index from the right.
left.reset_index().join(right, on='index', lsuffix='_')
index A_ B A C
0 X a 1 a 3
1 Y b 2 b 4
merge
Think of merge as aligning on columns. By default merge will look for overlapping columns in which to merge on. merge gives better control over merge keys by allowing the user to specify a subset of the overlapping columns to use with parameter on, or to separately allow the specification of which columns on the left and which columns on the right to merge by.
merge will return a combined dataframe in which the index will be destroyed.
This simple example finds the overlapping column to be 'A' and combines based on it.
left.merge(right)
A B C
0 a 1 3
1 b 2 4
Note the index is [0, 1] and no longer ['X', 'Y']
You can explicitly specify that you are merging on the index with the left_index or right_index paramter
left.merge(right, left_index=True, right_index=True, suffixes=['_', ''])
A_ B A C
X a 1 a 3
Y b 2 b 4
And this looks exactly like the join example above.