pyspark's 'between' function is not inclusive for timestamp input.
For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields
However, the document seem to hint that it is inclusive (no reference on timestamp though)
Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?
Example:
import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
+--------------------+-----+
Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.
For the above example, here is the output for exclusive search (use pd.to_datetime):
test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:
test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+