When I do:
allf = spark.read.parquet("gs://bucket/folder/*")
I get:
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
... And the following message after the list of paths:
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
I am new to Spark. I believe my data source is really a collection of "folders" (something like base/top_folder/year=x/month=y/*.parquet
) and I would like to load all the files and transform them.
Thanks for your help!
Per Spark documentation on Parquet partition discovery, I believe that changing your load statement from
allf = spark.read.parquet("gs://bucket/folder/*")
to
allf = spark.read.parquet("gs://bucket/folder")
should discover and load all parquet partitions. This is assuming that the data was written with "folder" as its base directory.
If the directory base/folder actually contains mutliple datasets, you will want to load each dataset independantly and then union them together.