I am using pySpark
, and have set up my dataframe with two columns representing a daily asset price as follows:
ind = sc.parallelize(range(1,5))
prices = sc.parallelize([33.3,31.1,51.2,21.3])
data = ind.zip(prices)
df = sqlCtx.createDataFrame(data,["day","price"])
I get upon applying df.show()
:
+---+-----+
|day|price|
+---+-----+
| 1| 33.3|
| 2| 31.1|
| 3| 51.2|
| 4| 21.3|
+---+-----+
Which is fine and all. I would like to have another column that contains the day-to-day returns of the price column, i.e., something like
(price(day2)-price(day1))/(price(day1))
After much research, I am told that this is most efficiently accomplished through applying the pyspark.sql.window
functions, but I am unable to see how.
You can bring the previous day column by using lag function, and add additional column that does actual day-to-day return from the two columns, but you may have to tell spark how to partition your data and/or order it to do lag, something like this:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
dfu = df.withColumn('user', lit('tmoore'))
df_lag = dfu.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Window.partitionBy("user")))
result = df_lag.withColumn('daily_return',
(df_lag['price'] - df_lag['prev_day_price']) / df_lag['price'] )
>>> result.show()
+---+-----+-------+--------------+--------------------+
|day|price| user|prev_day_price| daily_return|
+---+-----+-------+--------------+--------------------+
| 1| 33.3| tmoore| null| null|
| 2| 31.1| tmoore| 33.3|-0.07073954983922816|
| 3| 51.2| tmoore| 31.1| 0.392578125|
| 4| 21.3| tmoore| 51.2| -1.403755868544601|
+---+-----+-------+--------------+--------------------+
Here is longer introduction into Window functions in Spark.