Pandas Merging 101

  • A+
Category:Languages
  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?
  • How do I add NaNs for missing rows after merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • Cross join with pandas?
  • How do I merge multiple DataFrames?
  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This QnA is meant to be the next instalment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.

 


This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
  • Merging under different conditions - one-to-one/one-to-many/many-to-many
  • Cartesian Product (equivalent to SQL's CROSS JOIN)
  • Notable alternatives to merge and join

What this post will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to FULL INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Enough Talk, just show me how to use merge!

Setup

np.random.seed(0) left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})     right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})  left    key     value 0   A  1.494079 1   B -0.205158 2   C  0.313068 3   D -0.854096  right    key     value 0   B -2.552990 1   D  0.653619 2   E  0.864436 3   F -0.742165 

For the sake of simplicity, the key column has the same name (for now).

To perform an FULL INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.

pd.merge(left, right, on='key')    key   value_x   value_y 0   B  0.400157  1.867558 1   D  2.240893 -0.977278 

In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.

left.merge(right, on='key') # Or, if you want to be explicit # left.merge(right, on='key', how='inner')    key   value_x   value_y 0   B  0.400157  1.867558 1   D  2.240893 -0.977278 

A LEFT OUTER JOIN would be done by specifying how='left'.

left.merge(right, on='key', how='left')    key   value_x   value_y 0   A  1.764052       NaN 1   B  0.400157  1.867558 2   C  0.978738       NaN 3   D  2.240893 -0.977278 

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, specify how='right'...

left.merge(right, on='key', how='right')    key   value_x   value_y 0   B  0.400157  1.867558 1   D  2.240893 -0.977278 2   E       NaN  0.950088 3   F       NaN -0.151357 

...for a RIGHT OUTER JOIN. Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, specify how='outer'...

left.merge(right, on='key', how='outer')    key   value_x   value_y 0   A  1.764052       NaN 1   B  0.400157  1.867558 2   C  0.978738       NaN 3   D  2.240893 -0.977278 4   E       NaN  0.950088 5   F       NaN -0.151357 

...for a FULL OUTER JOIN. This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

Pandas Merging 101

Here's a few more specifics to keep in mind:

  1. If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

    left.merge(right, left_on='keyLeft', right_on='keyRight', ...) 
  2. To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

    left.merge(right, on=['key1', 'key2'] ...) 
  3. Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.

  4. pd.merge_ordered is a useful function for ordered JOINs.

  5. pd.merge_asof (read: merge_asOf) is useful for approximate joins.

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.


Index-based *-JOIN

left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])     right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F']) left.index.name = right.index.name = 'idxkey'  left            value idxkey           A       2.269755 B      -1.454366 C       0.045759 D      -0.187184  right            value idxkey           B       1.532779 D       1.469359 E       0.154947 F       0.378163 

Typically, a merge on index would look like this:

left.merge(right, left_index=True, right_index=True)           value_x   value_y idxkey                     B       0.410599  0.761038 D       1.454274  0.121675 

Support for index names
If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')           value_x   value_y idxkey                     B       0.410599  0.761038 D       1.454274  0.121675 

DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')           value_x   value_y idxkey                     B       0.410599  0.761038 D       1.454274  0.121675 

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

left.join(right) ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object') 

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')          leftvalue     value idxkey                      B       -1.454366  1.532779 D       -0.187184  1.469359 

pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat:

pd.concat([left, right], axis=1, sort=False, join='inner')             value     value idxkey                     B      -1.980796  1.230291 D       0.156349  1.202380 

Omit join='inner' if you need a FULL OUTER JOIN (the default):

pd.concat([left, right], axis=1, sort=False)        value     value A -0.887786       NaN B -1.980796  1.230291 C -0.347912       NaN D  0.156349  1.202380 E       NaN -0.387327 F       NaN -0.302303 

For more information, see this canonical post on pd.concat by @piRSquared.


CROSS JOIN with merge (+ honourable mentions)

While not recommended, merge can also perform a cross product

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]}) right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})  left    col1  col2 0    A     1 1    B     2 2    C     3  right    col1  col2 0    X    20 1    Y    30 2    Z    50  left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1)    col1_x  col2_x col1_y  col2_y 0      A       1      X      20 1      A       1      Y      30 2      A       1      Z      50 3      B       2      X      20 4      B       2      Y      30 5      B       2      Z      50 6      C       3      X      20 7      C       3      Y      30 8      C       3      Z      50 

How this works, is that both DataFrames are assigned a temporary "key" column with the same value. merge then performs a many-to-many JOIN on the key.

Towards Better Performance
While the many-to-many JOIN trick works for reasonably sized DataFrames, you shouldn't expect much in terms of performance for large data. For faster alternatives, you may want to look at a few NumPy implementations of the cartesian product. My favourite is @senderle's first implementation.

def cartesian_product(*arrays):     la = len(arrays)     dtype = np.result_type(*arrays)     arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)     for i, a in enumerate(np.ix_(*arrays)):         arr[...,i] = a     return arr.reshape(-1, la)   

idx = cartesian_product(np.ogrid[:len(left)], np.ogrid[:len(right)]) pd.DataFrame(     np.column_stack([left.iloc[idx[:,0]], right.iloc[idx[:,1]]]))     0  1  2   3 0  A  1  X  20 1  A  1  Y  30 2  A  1  Z  50 3  B  2  X  20 4  B  2  Y  30 5  B  2  Z  50 6  C  3  X  20 7  C  3  Y  30 8  C  3  Z  50 

The idea here is to take the cross product of the numeric indices, reindex the frames and concat the result (this is done with np.column_stack to avoid index alignment issues).

This is already 5x faster on my machine for just these tiny frames. This isn't even the fastest implementation of cross product either.


Generalizing: mergeing multiple DataFrames

Setup

A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})     B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)}) C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)}) dfs = [A, B, C]    

Oftentimes, the situation arises when multiple DataFrames are to be merged together. One simple trick often used is the functools.reduce function, and you can achieve a FULL INNER JOIN like so:

from functools import reduce reduce(pd.merge, dfs)    key    valueA    valueB  valueC 0   D  0.465662  1.488252     1.0 

Note that every column besides the "key" column should be differently named for this to work out-of-box. Otherwise, you may need to use a lambda.

For a FULL OUTER JOIN, you can curry pd.merge using functools.partial:

from functools import partial outer_merge = partial(pd.merge, how='outer') reduce(outer_merge, dfs)    key    valueA    valueB  valueC 0   A  0.056165       NaN     NaN 1   B -1.165150 -1.536244     NaN 2   C  0.900826       NaN     1.0 3   D  0.465662  1.488252     1.0 4   E       NaN  1.895889     1.0 5   F       NaN  1.178780     NaN 6   J       NaN       NaN     1.0 

Sometimes it also makes sense to use pd.concat, if you're not comfortable with a two-at-a-time merge.

pd.concat([     df.set_index('key') for df in dfs], axis=1, join='inner' ).reset_index()    key    valueA    valueB  valueC 0   D  0.465662  1.488252     1.0 

As always, omit join='inner' for a FULL OUTER JOIN.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: