I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
|
------- dir1_2
dir2 ---
|
------- dir2_1
|
------- dir2_2
sqlContext.read.parquet(dir1)
reads parquet files from dir1_1 and dir1_2
Right now I'm reading each dir and merging dataframes using "unionAll".
Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll
or is there any fancy way using unionAll
Thanks
A little late but I found this while I was searching and it may help someone else...
You might also try unpacking the argument list to spark.read.parquet()
paths=['foo','bar']
df=spark.read.parquet(*paths)
This is convenient if you want to pass a few blobs into the path argument:
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)
This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.