I am running this query in Spark shell but it gives me error,
sqlContext.sql(
"select sal from samplecsv where sal < (select MAX(sal) from samplecsv)"
).collect().foreach(println)
error:
java.lang.RuntimeException: [1.47] failure: ``)'' expected but identifier MAX found
select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27) Can anybody explan me,thanks
Planned features:
Spark 2.0+
Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuite
for details. Some examples include:
select * from l where exists (select * from r where l.a = r.c)
select * from l where not exists (select * from r where l.a = r.c)
select * from l where l.a in (select c from r)
select * from l where a not in (select c from r)
Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrame
DSL.
Spark < 2.0
Spark supports subqueries in the FROM
clause (same as Hive <= 0.12).
SELECT col FROM (SELECT * FROM t1 WHERE bar) t2
It simply doesn't support subqueries in the WHERE
clause.Generally speaking arbitrary subqueries (in particular correlated subqueries) couldn't be expressed using Spark without promoting to Cartesian join.
Since subquery performance is usually a significant issue in a typical relational system and every subquery can be expressed using JOIN
there is no loss-of-function here.