compute string length in Spark SQL DSL

Question 1

compute string length in Spark SQL DSL

apache-spark apache-spark-sql string-length

Wilmerton · Feb 16, 2015 · Viewed 24.7k times · Source

Answer

Answer

Try this in Spark Shell:

case class Note(id:Int,text:String)
val notes=List(Note(1,"One"),Note(2,"Two"),Note(3,"Three"))
val notesRdd=sc.parallelize(notes)
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
import hc.createSchemaRDD
notesRdd.registerTempTable("note")
hc.sql("select id, text, length(text) from note").foreach(println)

It works on by setup (out of the box spark 1.2.1 with hadoop 2.4):

[2,Two,3]
[1,One,3]
[3,Three,5]

Question 2

Edit: this is an old question concerning Spark 1.2

I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations.

My first attempt has been to use the integrated relational queries, for instance

notes.select('note).orderBy(length('note))

with no luck at the compilation:

error: not found: value length

(Which makes me wonder where to find what "Expression" this DSL can actually resolve. For instance, it resolves "+" for column additions.)

Then I tried

sql("SELECT note, length(note) as len FROM notes")

This fails with

java.util.NoSuchElementException: key not found: length

(Then I reread this (I'm running 1.2.0) http://spark.apache.org/docs/1.2.0/sql-programming-guide.html#supported-hive-features and wonder in what sense Spark SQL supports the listed hive features.)

Questions: is the length operator really supported in Expressions and/or in SQL statements? If yes, what is the syntax? (bonus: is there a specific documentation about what is resolved in Spark SQL Expressions, and what would be the syntax in general?)

Thanks!

compute string length in Spark SQL DSL

Answer

Related questions