Scala Spark : How to create a RDD from a list of string and convert to DataFrame

NehaM picture NehaM · Apr 21, 2016 · Viewed 44.2k times · Source

I want to create a DataFrame from a list of string that could match existing schema. Here is my code.

    val rowValues = List("ann", "f", "90", "world", "23456") // fails
    val rowValueTuple = ("ann", "f", "90", "world", "23456") //works

    val newRow = sqlContext.sparkContext.parallelize(Seq(rowValueTuple)).toDF(df.columns: _*)

    val newdf = df.unionAll(newRow).show()

The same code fails if i use the List of String. I see the difference is with rowValueTuple a Tuple is created. Since the size of rowValues list dynamically changes, i cannot manually create Tuple* object. How can i do this? What am i missing? How can i flatten this list to meet the requirement?

Appreciate your help, Please.

Answer

Vitalii Kotliarenko picture Vitalii Kotliarenko · Apr 21, 2016

DataFrame has schema with fixed number of columns, so it's seems not natural to make row per list of variable length. Anyway, you can create your DataFrame from RDD[Row] using existing schema, like this:

val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v: _*))
val newRow = sqlContext.createDataFrame(rdd, df.schema)