How to create an empty dataFrame in Spark

Question 1

How to create an empty dataFrame in Spark

scala apache-spark apache-spark-sql avro spark-avro

Vinay Kumar · May 30, 2018 · Viewed 18.5k times · Source

Answer

Answer

You don't need to use emptyRDD. Here is what worked for me with PySpark 2.4:

empty_df = spark.createDataFrame([], schema) # spark is the Spark Session

If you already have a schema from another dataframe, you can just do this:

schema = some_other_df.schema

If you don't, then manually create the schema of the empty dataframe, for example:

schema = StructType([StructField("col_1", StringType(), True),
                     StructField("col_2", DateType(), True),
                     StructField("col_3", StringType(), True),
                     StructField("col_4", IntegerType(), False)]
                     )

I hope this helps.

Question 2

I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from underlying HDFS dir.

Everything works fine except when the table is empty. I have managed to get the schema from the .avsc file of hive table using the following command but I am getting an error "No Avro files found"

val schemaFile = FileSystem.get(sc.hadoopConfiguration).open(new Path("hdfs://myfile.avsc"));

val schema = new Schema.Parser().parse(schemaFile);

spark.read.format("com.databricks.spark.avro").option("avroSchema", schema.toString).load("/tmp/myoutput.avro").show()

Workarounds:

I have placed an empty file in that directory and the same thing works fine.

Are there any other ways to achieve the same? like conf setting or something?

How to create an empty dataFrame in Spark

Answer

Related questions