spark "basePath" option setting

Question 1

spark "basePath" option setting

apache-spark pyspark google-cloud-dataproc

jldupont · Nov 15, 2016 · Viewed 14k times · Source

Answer

Answer

Per Spark documentation on Parquet partition discovery, I believe that changing your load statement from

allf = spark.read.parquet("gs://bucket/folder/*")

to

allf = spark.read.parquet("gs://bucket/folder")

should discover and load all parquet partitions. This is assuming that the data was written with "folder" as its base directory.

If the directory base/folder actually contains mutliple datasets, you will want to load each dataset independantly and then union them together.

Question 2

When I do:

allf = spark.read.parquet("gs://bucket/folder/*")

I get:

java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:

... And the following message after the list of paths:

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

I am new to Spark. I believe my data source is really a collection of "folders" (something like base/top_folder/year=x/month=y/*.parquet) and I would like to load all the files and transform them.

Thanks for your help!

UPDATE 1: I've looked at the Dataproc console and there is no way to set "options" when creating a cluster.
UPDATE 2: I've checked in the cluster's "cluster.properties" file and there is no such options. Could it be I must add one and reset the cluster?

spark "basePath" option setting

Answer

Related questions