Unable to infer schema when loading Parquet file

Question 1

Unable to infer schema when loading Parquet file

apache-spark pyspark parquet

user48956 · Jul 6, 2017 · Viewed 79.9k times · Source

Answer

Answer

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before write it.

Question 2

response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

But then:

outcome2 = sqlc.read.parquet(response)  # fail

fails with:

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

in

/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?

Using Spark 2.1.1. Also fails in 2.2.0.

Found this bug report, but was fixed in 2.0.1, 2.1.0.

UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".

Unable to infer schema when loading Parquet file

Answer

Related questions