Partitioning by multiple columns in PySpark with columns in a list

prk picture prk · Mar 12, 2018 · Viewed 19.4k times · Source

My question is similar to this thread: Partitioning by multiple columns in Spark SQL

but I'm working in Pyspark rather than Scala and I want to pass in my list of columns as a list. I want to do something like this:

column_list = ["col1","col2"]
win_spec = Window.partitionBy(column_list)

I can get the following to work:

win_spec = Window.partitionBy(col("col1"))

This also works:

col_name = "col1"
win_spec = Window.partitionBy(col(col_name))

And this also works:

win_spec = Window.partitionBy([col("col1"), col("col2")])

Answer

Psidom picture Psidom · Mar 12, 2018

Convert column names to column expressions with a list comprehension [col(x) for x in column_list]:

from pyspark.sql.functions import col
column_list = ["col1","col2"]
win_spec = Window.partitionBy([col(x) for x in column_list])