What’s the essential difference(s) between pd.DataFrame.merge() and pd.concat()?

So far, this is what I found, please comment on how complete and accurate my understanding is:

  • .merge() can only use columns (plus row-indices) and it is semantically suitable for database-style operations. .concat() can be used with either axis, using only indices, and gives the option for adding a hierarchical index.

  • Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.

  • pd.DataFrame.join() merely offers a shorthand for a subset of the use cases of .merge()

(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )

A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

For example:

df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})

df1:
   Key  data1
0   b   0
1   b   1
2   a   2
3   c   3
4   a   4
5   a   5
6   b   6

df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})

df2:
    Key data2
0   a   0
1   b   1
2   d   2

#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is 
# a common column in 2 dataframes

pd.merge(df1, df2)

   Key data1 data2
0   b    0    1
1   b    1    1
2   b    6    1
3   a    2    0
4   a    4    0
5   a    5    0

#Concat
# df2 dataframe is appended at the bottom of df1 

pd.concat([df1, df2])

   Key data1 data2
0   b   0     NaN
1   b   1     NaN
2   a   2     NaN
3   c   3     NaN
4   a   4     NaN
5   a   5     NaN
6   b   6     NaN
0   a   Nan   0
1   b   Nan   1
2   d   Nan   2

At a high level:

  • .concat() simply stacks multiple DataFrame together either
    vertically, or stitches horizontally after aligning on index
  • .merge() first aligns two DataFrame‘ selected common column(s) or
    index, and then pick up the remaining columns from the aligned rows of each DataFrame.

More specifically, .concat():

  • Is a top-level pandas function
  • Combines two or more pandas DataFrame vertically or horizontally
  • Aligns only on the index when combining horizontally
  • Errors when any of the DataFrame contains a duplicate index.
  • Defaults to outer join with the option for inner join

And .merge():

  • Exists both as a top-level pandas function and a DataFrame method (as of pandas 1.0)
  • Combines exactly two DataFrame horizontally
  • Aligns the calling DataFrame‘s column(s) or index with the other
    DataFrame‘s column(s) or index
  • Handles duplicate values on the joining columns or index by
    performing a cartesian product
  • Defaults to inner join with options for left, outer, and right

Note that when performing pd.merge(left, right), if left has two rows containing the same values from the joining columns or index, each row will combine with right‘s corresponding row(s) resulting in a cartesian product. On the other hand, if .concat() is used to combine columns, we need to make sure no duplicated index exists in either DataFrame.

Practically speaking:

  • Consider .concat() first when combining homogeneous DataFrame, while
    consider .merge() first when combining complementary DataFrame.
  • If need to merge vertically, go with .concat(). If need to merge
    horizontally via columns, go with .merge(), which by default merge on the columns in common.

Reference: Pandas 1.x Cookbook

pd.concat takes an Iterable as its argument. Hence, it cannot take DataFrames directly as its argument. Also Dimensions of the DataFrame should match along axis while concatenating.

pd.merge can take DataFrames as its argument, and is used to combine two DataFrames with same columns or index, which can’t be done with pd.concat since it will show the repeated column in the DataFrame.

Whereas join can be used to join two DataFrames with different indices.

I am currently trying to understand the essential difference(s) between pd.DataFrame.merge() and pd.concat().

Nice question. The main difference:

pd.concat works on both axes.

The other difference, is pd.concat has innerdefault and outer joins only, while pd.DataFrame.merge() has left, right, outer, innerdefault joins.

Third notable other difference is: pd.DataFrame.merge() has the option to set the column suffixes when merging columns with the same name, while for pd.concat this is not possible.


With pd.concat by default you are able to stack rows of multiple dataframes (axis=0) and when you set the axis=1 then you mimic the pd.DataFrame.merge() function.

Some useful examples of pd.concat:

df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe

df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end

df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's

The main difference between merge & concat is that merge allow you to perform more structured “join” of tables where use of concat is more broad and less structured.

Merge

Referring the documentation, pd.DataFrame.merge takes right as a required argument, which you can think it as joining left table and right table according to some pre-defined structured join operation. Note the definition for parameter right.

Required Parameters

  • right: DataFrame or named Series

Optional Parameters

  • how: {‘left’, ‘right’, ‘outer’, ‘inner’} default ‘inner’
  • on: label or list
  • left_on: label or list, or array-like
  • right_on: label or list, or array-like
  • left_index: bool, default False
  • right_index: bool, default False
  • sort: bool, default False
  • suffixes: tuple of (str, str), default (‘_x’, ‘_y’)
  • copy: bool, default True
  • indicator: bool or str, default False
  • validate: str, optional

Important: pd.DataFrame.merge requires right to be a pd.DataFrame or named pd.Series object.

Output

  • Returns: DataFrame

Furthermore, if we check the docstring for Merge Operation on pandas is below:

Perform a database (SQL) merge operation between two DataFrame or Series
objects using either columns as keys or their row indexes

Concat

Refer to documentation of pd.concat, first note that the parameter is not named any of table, data_frame, series, matrix, etc., but objs instead. That is, you can pass many “data containers”, which are defined as:

Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]

Required Parameters

  • objs: a sequence or mapping of Series or DataFrame objects

Optional Parameters

  • axis: {0/’index’, 1/’columns’}, default 0
  • join: {‘inner’, ‘outer’}, default ‘outer’
  • ignore_index: bool, default False
  • keys: sequence, default None
  • levels: list of sequences, default None
  • names: list, default None
  • verify_integrity: bool, default False
  • sort: bool, default False
  • copy: bool, default True

Output

  • Returns: object, type of objs

Example

Code

import pandas as pd

v1 = pd.Series([1, 5, 9, 13])
v2 = pd.Series([10, 100, 1000, 10000])
v3 = pd.Series([0, 1, 2, 3])

df_left = pd.DataFrame({
    "v1": v1,
    "v2": v2,
    "v3": v3
    })
df_right = pd.DataFrame({
    "v4": [5, 5, 5, 5],
    "v5": [3, 2, 1, 0]
    })


df_concat = pd.concat([v1, v2, v3])

# Performing operations on default

merge_result = df_left.merge(df_right, left_index=True, right_index=True)
concat_result = pd.concat([df_left, df_right], sort=False)
print(merge_result)
print('='*20)
print(concat_result)

Code Output

   v1     v2  v3  v4  v5
0   1     10   0   5   3
1   5    100   1   5   2
2   9   1000   2   5   1
3  13  10000   3   5   0
====================
     v1       v2   v3   v4   v5
0   1.0     10.0  0.0  NaN  NaN
1   5.0    100.0  1.0  NaN  NaN
2   9.0   1000.0  2.0  NaN  NaN
3  13.0  10000.0  3.0  NaN  NaN
0   NaN      NaN  NaN  5.0  3.0
1   NaN      NaN  NaN  5.0  2.0
2   NaN      NaN  NaN  5.0  1.0

You can achieve, however, the first output (merge) with concat by changing the axis parameter

concat_result = pd.concat([df_left, df_right], sort=False, axis=1)

Observe the following behavior,

concat_result = pd.concat([df_left, df_right, df_left, df_right], sort=False)

outputs;

     v1       v2   v3   v4   v5
0   1.0     10.0  0.0  NaN  NaN
1   5.0    100.0  1.0  NaN  NaN
2   9.0   1000.0  2.0  NaN  NaN
3  13.0  10000.0  3.0  NaN  NaN
0   NaN      NaN  NaN  5.0  3.0
1   NaN      NaN  NaN  5.0  2.0
2   NaN      NaN  NaN  5.0  1.0
3   NaN      NaN  NaN  5.0  0.0
0   1.0     10.0  0.0  NaN  NaN
1   5.0    100.0  1.0  NaN  NaN
2   9.0   1000.0  2.0  NaN  NaN
3  13.0  10000.0  3.0  NaN  NaN
0   NaN      NaN  NaN  5.0  3.0
1   NaN      NaN  NaN  5.0  2.0
2   NaN      NaN  NaN  5.0  1.0
3   NaN      NaN  NaN  5.0  0.0

, which you cannot perform a similar operation with merge, since it only allows a single DataFrame or named Series.

merge_result = df_left.merge([df_right, df_left, df_right], left_index=True, right_index=True)

outputs;

TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed

Conclusion

As you may have notice already that input and outputs may be different between “merge” and “concat”.

As I mentioned at the beginning, the very first (main) difference is that “merge” performs a more structured join with a set of restricted set of objects and parameters where as “concat” performs a less strict/broader join with a broader set of objects and parameters.

All in all, merge is less tolerant to changes/(the input) and “concat” is looser/less sensitive to changes/(the input). You can achieve “merge” by using “concat”, but the reverse is not always true.

“Merge” operation uses Data Frame columns (or name of pd.Series object) or row indices, and since it uses those entities only it performs horizontal merge of Data Frames or Series, and does not apply vertical operation as a result.

If you want to see more, you can deep dive in the source code a bit;

Only concat function has axis parameter. Merge is used to combine dataframes side-by-side based on values in shared columns so there is no need for axis parameter.

by default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join

pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis

Join and pd.merge:
can take DataFrame arguments

Click to see picture for understanding why code below does the same thing

df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)