The 4 Types of DataFrame Joins Offered by Python’s pandas Library

As a data analyst, you will often face the need to combine multiple datasets. You’ll need to do this to complete your analysis and arrive at a conclusion for your business/stakeholders.
It’s often challenging to represent data when it’s stored in different tables. In such circumstances, joins prove their worth, irrespective of the programming language you are working on.
Python joins are like SQL joins: they combine data sets by matching their rows on a common index.
Create Two DataFrames for Reference
To follow the examples in this guide, you can create two sample DataFrames. Use the following code to create the first DataFrame, which contains an ID, first name, and last name.
import pandas as pda = pd.DataFrame({"ID": ["001", "002", "003", "004", "005"],
"Fname": ["Ron", "John", "Helen", "Jenny", "Kenny"],
"Lname": ["Keith", "Harley", "Smith", "Kerr-Hislop", "Barber"]})
print(a)
For the first step, import the pandas library. You can then use a variable, a, to store the result from the DataFrame constructor. Pass the constructor a dictionary containing your required values.
Finally, display the contents of the DataFrame value with the print function, to check everything looks as you’d expect.
Similarly, you can create another DataFrame, b, which contains an ID and salary values.
b = pd.DataFrame({"ID": ["001", "002", "003", "004", "005"],
"Salary": [100000, 700000, 80000, 904750, 604772]})print(b)
You can check the output in a console or an IDE. It should confirm the contents of your DataFrames:
How Are Joins Different From the Merge Function in Python?
The pandas library is one of the main libraries you can use to manipulate DataFrames. Since DataFrames contain multiple data sets, various functions are available in Python to join them.
Python offers the join and merge functions, amongst many others, which you can use to combine DataFrames. There is a stark difference between these two functions, which you must keep in mind before using either.
The join function joins two DataFrames based on their index values. The merge function combines DataFrames based on the index values and the columns.
What Do You Need to Know About Joins in Python?
Before discussing the types of joins available, here are some important things to note:
- SQL joins are one of the most basic functions and are quite similar to Python’s joins.
- To join DataFrames, you can use the pandas.DataFrame.join() method.
- The default join performs a left join, whereas the merge function performs an inner join.
The default syntax for a Python join is as follows:
DataFrame.join(other, on=None, how='left/right/inner/outer', lsuffix='', rsuffix='',
sort=False)
Invoke the join method on the first DataFrame and pass the second DataFrame as its first parameter, other. The remaining arguments are:
- on, which names an index to join on, if there is more than one.
- how, which defines the join type, including inner, outer, left, and right.
- lsuffix, which defines the left suffix string of your column name.
- rsuffix, which defines the right suffix string of your column name.
- sort, which is a boolean indicating whether to sort the resulting DataFrame.
Learn to Use the Various Types of Joins in Python
Python has a few join options, which you can exercise, depending on the need of the hour. Here are the join types:
1. Left Join
The left join keeps the first DataFrame’s values intact while bringing in the matching values from the second one. For example, if you want to bring in the matching values from b, you can define it as follows:
c = a.join(b, how="left", lsuffix = "_left", rsuffix = "_right", sort = True)
print(c)
When the query executes, the output contains the following column references:
- ID_left
- Fname
- Lname
- ID_right
- Salary
This join pulls the first three columns from the first DataFrame, and the last two columns from the second DataFrame. It has used the lsuffix and rsuffix values to rename the ID columns from both datasets, ensuring the resulting field names are unique.
The output is as follows:
2. Right Join
The right join keeps the second DataFrame’s values intact, while bringing in the matching values from the first table. For example, if you want to bring in the matching values from a, you can define it as follows:
c = b.join(a, how="right", lsuffix = "_right", rsuffix = "_left", sort = True)
print(c)
The output is as follows:
If you review the code, there are a few evident changes. For example, the result includes the second DataFrame’s columns before those from the first DataFrame.
You should use a value of right for the how argument to specify a right join. Also, note how you can switch the lsuffix and rsuffix values to reflect the nature of the right join.
In your regular joins, you might find yourself using left, inner and outer joins more frequently, as compared to the right join. However, the usage depends entirely on your data requirements.
3. Inner Join
An inner join delivers the matching entries from both the DataFrames. Since joins use the index numbers to match rows, an inner join only returns rows that match. For this illustration, let’s use the following two DataFrames:
a = pd.DataFrame({"ID": ["001", "002", "003", "004", "005", "006", "007"],
"Fname": ["Ron", "John", "Helen", "Jenny", "Kenny", "Daryl", "Cathy"],
"Lname": ["Keith", "Harley", "Smith", "Kerr-Hislop", "Barber", "Hooper", "Hope"]})
b = pd.DataFrame({"ID": ["001", "002", "003", "004", "005"],
"Salary": [100000, 700000, 80000, 904750, 604772]})print(a)
print(b)
The output is as follows:
You can use an inner join, as follows:
c = a.join(b, lsuffix="_left", rsuffix="_right", how='inner')
print(c)
The resulting output contains only rows that exist in both input DataFrames:
4. Outer Join
An outer join returns all the values from both the DataFrames. For rows with no matching values, it produces a null value on the individual cells.
Using the same DataFrame as above, here’s the code for outer join:
c = a.join(b, lsuffix="_left", rsuffix="_right", how='outer')
print(c)
Using Joins in Python
Joins, like their counterpart functions, merge and concat, offer way more than a simple join functionality. Given its series of options and functions, you can choose the options that meet your requirements.
You can sort the resulting datasets relatively easily, with or without the join function, with the flexible options that Python offers.