Spark specify multiple column conditions for dataframe join

user568109 picture user568109 · Jul 6, 2015 · Viewed 106.9k times · Source

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master,  
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

Answer

rchukh picture rchukh · Jul 6, 2015

There is a Spark column/expression API join for such case:

Leaddetails.join(
    Utm_Master, 
    Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
        && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
        && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
        && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
    "left"
)

The <=> operator in the example means "Equality test that is safe for null values".

The main difference with simple Equality test (===) is that the first one is safe to use in case one of the columns may have null values.