pyspark's "between" function: range search on timestamps is not inclusive

Vinay Kolar picture Vinay Kolar · Apr 14, 2017 · Viewed 37.4k times · Source

pyspark's 'between' function is not inclusive for timestamp input.

For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields

However, the document seem to hint that it is inclusive (no reference on timestamp though)

Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?

Example:

import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
+--------------------+-----+

Answer

Vinay Kolar picture Vinay Kolar · Apr 14, 2017

Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.

  1. If you provide the the input in string format without time, it performs an exclusive search (Not what we expect from the documentation linked above).
  2. If you provide the input as datetime object or with exact time (e.g., '2017-04-14 00:00:00', then it performs an inclusive search.

For the above example, here is the output for exclusive search (use pd.to_datetime):

test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:

test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+