How to subtract a column of days from a column of dates in Pyspark?

Question 1

How to subtract a column of days from a column of dates in Pyspark?

python apache-spark pyspark apache-spark-sql user-defined-functions

kjmij · Mar 17, 2016 · Viewed 16.3k times · Source

Answer

Answer

Use expr function (if you have dynamic values from columns to substract):

>>> from pyspark.sql.functions import *
>>> df.withColumn('substracted_dates',expr("date_sub(date_col,days_col)"))

Use withColumn function(if you have literal values to substract):

>>> df.withColumn('substracted_dates',date_sub('date_col',<int_literal_value>))

Question 2

Given the following PySpark DataFrame

df = sqlContext.createDataFrame([('2015-01-15', 10),
                                 ('2015-02-15', 5)],
                                 ('date_col', 'days_col'))

How can the days column be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'].

I looked into pyspark.sql.functions.date_sub(), but it requires a date column and a single day, i.e. date_sub(df['date_col'], 10). Ideally, I'd prefer to do date_sub(df['date_col'], df['days_col']).

I also tried creating a UDF:

from datetime import timedelta
def subtract_date(start_date, days_to_subtract):
    return start_date - timedelta(days_to_subtract)

subtract_date_udf = udf(subtract_date, DateType())
df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col'])

This technically works, but I've read that stepping between Spark and Python can cause performance issues for large datasets. I can stick with this solution for now (no need to prematurely optimize), but my gut says there's just got to be a way to do this simple thing without using a Python UDF.

How to subtract a column of days from a column of dates in Pyspark?

Answer

Related questions