Spark - SELECT WHERE or filtering?

lte__ picture lte__ · Aug 10, 2016 · Viewed 152.2k times · Source

What's the difference between selecting with a where clause and filtering in Spark?
Are there any use cases in which one is more appropriate than the other one?

When do I use

DataFrame newdf = df.select(df.col("*")).where(df.col("somecol").leq(10))

and when is

DataFrame newdf = df.select(df.col("*")).filter("somecol <= 10")

more appropriate?

Answer

Yaron picture Yaron · Aug 10, 2016

According to spark documentation "where() is an alias for filter()"

filter(condition) Filters rows using the given condition. where() is an alias for filter().

Parameters: condition – a Column of types.BooleanType or a string of SQL expression.

>>> df.filter(df.age > 3).collect()
[Row(age=5, name=u'Bob')]
>>> df.where(df.age == 2).collect()
[Row(age=2, name=u'Alice')]

>>> df.filter("age > 3").collect()
[Row(age=5, name=u'Bob')]
>>> df.where("age = 2").collect()
[Row(age=2, name=u'Alice')]