spark.sql.crossJoin.enabled for Spark 2.x

Stijn picture Stijn · Aug 17, 2016 · Viewed 16k times · Source

I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled) that prohibits cartesian products and an Exception is thrown. How can I set spark.sql.crossJoin.enabled=true, preferably by using an initialization action? spark.sql.crossJoin.enabled=true

Answer

zero323 picture zero323 · Aug 17, 2016

Spark >= 3.0

spark.sql.crossJoin.enable is true by default (SPARK-28621).

Spark >= 2.1

You can use crossJoin:

df1.crossJoin(df2)

It makes your intention explicit and keeps more conservative configuration in place to protect you from unintended cross joins.

Spark 2.0

SQL properties can be set dynamically on runtime with RuntimeConfig.set method so you should be able to call

spark.conf.set("spark.sql.crossJoin.enabled", true)

whenever you want to explicitly allow Cartesian product.