VOOZH about

URL: https://www.geeksforgeeks.org/python/how-to-join-on-multiple-columns-in-pyspark/

⇱ How to join on multiple columns in Pyspark? - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

How to join on multiple columns in Pyspark?

Last Updated : 19 Dec, 2021

In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python.

Let's create the first dataframe:

Output:

👁 Image

Let's create the second dataframe:

Output:

👁 Image

we can join the multiple columns by using join() function using conditional operator

Syntax: dataframe.join(dataframe1, (dataframe.column1== dataframe1.column1) & (dataframe.column2== dataframe1.column2))

where, 

  • dataframe is the first dataframe
  • dataframe1 is the second dataframe
  • column1 is the first matching column in both the dataframes
  • column2 is the second matching column in both the dataframes

Example 1: PySpark code to join the two dataframes with multiple columns (id and name)

Output:

👁 Image

Example 2: Join with or operator

Output:

👁 Image
Comment
Article Tags: