I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from underlying HDFS dir.
Everything works fine except when the table is empty. I have managed to get the schema from the .avsc file of hive table using the following command but I am getting an error "No Avro files found"
val schemaFile = FileSystem.get(sc.hadoopConfiguration).open(new Path("hdfs://myfile.avsc"));
val schema = new Schema.Parser().parse(schemaFile);
spark.read.format("com.databricks.spark.avro").option("avroSchema", schema.toString).load("/tmp/myoutput.avro").show()
Workarounds:
I have placed an empty file in that directory and the same thing works fine.
Are there any other ways to achieve the same? like conf setting or something?
You don't need to use emptyRDD. Here is what worked for me with PySpark 2.4:
empty_df = spark.createDataFrame([], schema) # spark is the Spark Session
If you already have a schema from another dataframe, you can just do this:
schema = some_other_df.schema
If you don't, then manually create the schema of the empty dataframe, for example:
schema = StructType([StructField("col_1", StringType(), True),
StructField("col_2", DateType(), True),
StructField("col_3", StringType(), True),
StructField("col_4", IntegerType(), False)]
)
I hope this helps.