Top "Parquet" questions

Apache Parquet is a columnar storage format for Hadoop.

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any …

java json hadoop parquet
Using Spark to write a parquet file to s3 over s3a is very slow

I'm trying to write a parquet file out to Amazon S3 using Spark 1.6.1. The small parquet that I'm generating is ~2…

scala amazon-s3 apache-spark apache-spark-sql parquet
Spark SQL saveAsTable is not compatible with Hive when partition is specified

Kind of edge case, when saving parquet table in Spark SQL with partition, #schema definitioin final StructType schema = DataTypes.createStructType(…

hive apache-spark-sql partitioning parquet
How to read a nested collection in Spark

I have a parquet table with one of the columns being , array<struct<col1,col2,..colN>> …

apache-spark apache-spark-sql nested parquet lateral-join
How to identify Pandas' backend for Parquet

I understand that Pandas can read and write to and from Parquet files using different backends: pyarrow and fastparquet. I …

python pandas parquet
Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save …

scala apache-spark append parquet
Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that …

apache-spark parquet
how does hive create table using parquet and snappy

I know the syntax for creating a table using parquet but I want to know what does this mean to …

hive parquet snappy
How to deal with tasks running too long (comparing to others in job) in yarn-client?

We use a Spark cluster as yarn-client to calculate several business, but sometimes we have a task run too long …

apache-spark yarn parquet
Spark save(write) parquet only one file

if i write dataFrame.write.format("parquet").mode("append").save("temp.parquet") in temp.parquet folder i got the same …

scala apache-spark parquet