VOOZH about

URL: https://towardsdatascience.com/pandas-join-vs-merge-c365fd4fbf49/

⇱ Pandas Join vs. Merge | Towards Data Science


Pandas Join vs. Merge

What Do They Do And When Should We Use Each One?

8 min read
👁 Photo by Roman Kraft on Unsplash
Photo by Roman Kraft on Unsplash

I write a lot about statistics and algorithms, but getting your data ready for modeling is a huge part of data science as well. In fact, it’s highly likely that you will spend significantly more time staring at your data, checking it, and fixing its holes than on training and tweaking your models.

So the better we get at collecting, cleaning, and performing quick "sanity check" analyses on data, the more time we can spend on modeling (which most folks find more entertaining). To that end, let’s go over how we can quickly combine data from different dataframes and get it ready for analysis.


The Data

Let’s pretend that we’re analysts for a company that manufactures and sells paper clips. We need to run some reports on our firm’s sales department to see how they are doing and are given the data in the following dictionaries:

import numpy as np
import pandas as pd
# Dataframe of number of sales made by an employee
sales = {'Tony': 103,
 'Sally': 202,
 'Randy': 380,
 'Ellen': 101,
 'Fred': 82
 }
# Dataframe of all employees and the region they work in
region = {'Tony': 'West',
 'Sally': 'South',
 'Carl': 'West',
 'Archie': 'North',
 'Randy': 'East',
 'Ellen': 'South',
 'Fred': np.nan,
 'Mo': 'East',
 'HanWei': np.nan,
 }

We can create two separate dataframes from the dictionaries like so:

# Make dataframes
sales_df = pd.DataFrame.from_dict(sales, orient='index', 
 columns=['sales'])
region_df = pd.DataFrame.from_dict(region, orient='index', 
 columns=['region'])

The dataframe, sales_df, now looks like this:

sales
Tony 103
Sally 202
Randy 380
Ellen 101
Fred 82

And region_df looks like this:

 region
Tony West
Sally South
Carl West
Archie North
Randy East
Ellen South
Fred NaN
Mo East
HanWei NaN

Should I Merge, Join, Or Concatenate?

Now let’s combine all of our data into a single dataframe. But how do we do that?

Pandas dataframes have a lot of SQL like functionality. In fact I much prefer them to SQL tables (data analysts around the world are staring daggers at me). But when I first started doing a lot of SQL-like stuff with Pandas, I found myself perpetually unsure whether to use join or merge, and often I just used them interchangeably (picking whichever came to mind first). So when should we be using each of these methods, and how exactly are they different from each other? Well, it’s time to be confused no more!


Join

(If you are unfamiliar with what it means to join tables, I wrote this post about it, and I highly recommend that you read it first)

Let’s start with join because it’s the simplest one. Dataframes have this thing called an index. It’s the key to your table and if we know the index, then we can easily grab the row that holds our data using .loc. If you print your dataframe, you can see what the index is by looking at the leftmost column, or we can be more direct and just use .index:

In: sales_df.index
Out: Index(['Tony', 'Sally', 'Randy', 'Ellen', 'Fred'], 
 dtype='object')

So the index of sales_df is the name of our salespeople. By the way, unlike the primary key of a SQL table, a dataframe’s index does not have to be unique. But a unique index makes our lives easier and the time it takes to search our dataframe shorter, so it’s definitely a nice to have. Given an index, we can find the row data like so:

In: sales_df.loc['Tony']
Out: sales 103
 Name: Tony, dtype: int64

OK, back to join. The join method takes two dataframes and joins them on their indexes (technically, you can pick the column to join on for the left dataframe). Let’s see what happens when we combine our two dataframes together via the join method:

In: joined_df = region_df.join(sales_df, how='left')
 print(joined_df)
Out: 
 region sales
Tony West 103.0
Sally South 202.0
Carl West NaN
Archie North NaN
Randy East 380.0
Ellen South 101.0
Fred NaN 82.0
Mo East NaN
HanWei NaN NaN

The result looks like the output of a SQL join, which it more or less is. The join method uses the index or a specified column from the dataframe that it’s called on, a.k.a. the left dataframe, as the join key. So the column that we match on for the left dataframe doesn’t have to be its index. But for the right dataframe, the join key must be its index. I personally find it easier to think of the join method as joining based on the index, and to use merge (coming up) if I don’t want to join on the indexes.

In the combined dataframe there were some NaNs. That’s because not all of the employees had sales. The ones that did not have sales are not present in sales_df, but we still display them because we executed a left join (by specifying "how=left"), which returns all the rows from the left dataframe, region_df, regardless of whether there is a match. If we do not want to display any NaNs in our join result, we would do an inner join instead (by specifying "how=inner").


Merge

At a basic level, merge more or less does the same thing as join. Both methods are used to combine two dataframes together, but merge is more versatile at the cost of requiring more detailed inputs. Let’s take a look at how we can create the same combined dataframe with merge as we did with join:

In: joined_df_merge = region_df.merge(sales_df, how='left', 
 left_index=True,
 right_index=True)
 print(joined_df_merge)
Out:
 region sales
Tony West 103.0
Sally South 202.0
Carl West NaN
Archie North NaN
Randy East 380.0
Ellen South 101.0
Fred NaN 82.0
Mo East NaN
HanWei NaN NaN

Not that different from when we used join. But merge allows us to specify what columns to join on for both the left and right dataframes. Here by setting "left_index" and "right_index" equal to True, we let merge know that we want to join on the indexes. And we get the same combined dataframe as we obtained before when we used join.

Merge is useful when we don’t want to join on the index. For example, let’s say we want to know, in percentage terms, how much each employee contributed to their region. We can use groupby to sum up all the sales within each unique region. In the code below, the reset_index is used to shift region from being the dataframe’s (grouped_df’s) index to being just a normal column – and yes, we could just keep it as the index and join on it, but I want to demonstrate how to use merge on columns.

In: grouped_df = joined_df_merge.groupby(by='region').sum()
 grouped_df.reset_index(inplace=True)
 print(grouped_df)
Out:
 region sales
0 East 380.0
1 North 0.0
2 South 303.0
3 West 103.0

Now let’s merge joined_df_merge with grouped_df using the region column. We have to specify a suffix because both of our dataframes (that we are merging) contain a column called sales. The suffixes input appends the specified strings to the labels of columns that have identical names in both dataframes. In our case, since the second dataframe’s sales column is actually sales for the entire region, we can append "_region" to its label to make clear.

In:
employee_contrib = joined_df_merge.merge(grouped_df, how='left', 
 left_on='region', 
 right_on='region',
 suffixes=('','_region'))
print(employee_contrib)

Out:
 region sales sales_region
0 West 103.0 103.0
1 South 202.0 303.0
2 West NaN 103.0
3 North NaN 0.0
4 East 380.0 380.0
5 South 101.0 303.0
6 NaN 82.0 NaN
7 East NaN 380.0
8 NaN NaN NaN

Oh no, our index disappeared! But we can use set_index to get it back (otherwise we won’t know which employee each row corresponds to):

In:
employee_contrib = employee_contrib.set_index(joined_df_merge.index)
print(employee_contrib)

Out:
 region sales sales_region
Tony West 103.0 103.0
Sally South 202.0 303.0
Carl West NaN 103.0
Archie North NaN 0.0
Randy East 380.0 380.0
Ellen South 101.0 303.0
Fred NaN 82.0 NaN
Mo East NaN 380.0
HanWei NaN NaN NaN

We now have our original sales column and a new column sales_region that tells us the total sales made in a region. Let’s calculate each employees percentage of sales and then clean up our dataframe by dropping observations that have no region (Fred and HanWei) and filling the NaNs in the sales column with zeros:n

In:
# Drop NAs in region column
employee_contrib = employee_contrib.dropna(subset=['region'])
# Fill NAs in sales column with 0
employee_contrib = employee_contrib.fillna({'sales': 0})
employee_contrib['%_of_sales'] = employee_contrib['sales']/employee_contrib['sales_region']
print(employee_contrib[['region','sales','%_of_sales']]
 .sort_values(by=['region','%_of_sales']))

Out:
 region sales %_of_sales
Mo East 0.0 0.000000
Randy East 380.0 1.000000
Archie North 0.0 NaN
Ellen South 101.0 0.333333
Sally South 202.0 0.666667
Carl West 0.0 0.000000
Tony West 103.0 1.000000

All done! Notice that the North region has no sales hence the NaN (can’t divide by zero).


Conclusion

Let’s do a quick review:

  • We can use join and merge to combine 2 dataframes.
  • The join method works best when we are joining dataframes on their indexes (though you can specify another column to join on for the left dataframe).
  • The merge method is more versatile and allows us to specify columns besides the index to join on for both dataframes. If the index gets reset to a counter post merge, we can use set_index to change it back.

Next time, we will check out how to add new data rows via Pandas’ concatenate function (and much more). Cheers!


If you liked this article and my writing in general, please consider supporting my writing by signing up for Medium via my referral link here. Thanks!


Written By

Tony Yiu

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles