Kind of edge case, when saving parquet table in Spark SQL with partition,
#schema definitioin
final StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("time", DataTypes.StringType, true),
DataTypes.createStructField("accountId", DataTypes.StringType, true),
...
DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);
df.coalesce(1)
.write()
.mode(SaveMode.Append)
.format("parquet")
.partitionBy("year")
.saveAsTable("tblclick8partitioned");
Spark warns:
Persisting partitioned data source relation into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive
In Hive:
hive> describe tblclick8partitioned;
OK
col array<string> from deserializer
Time taken: 0.04 seconds, Fetched: 1 row(s)
Obviously the schema is not correct - however if I use saveAsTable
in Spark SQL without partition the table can be queried without problem.
Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?
That's because DataFrame.saveAsTable creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable. An example from SPARK-14927 looks like this:
hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")
Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")