create parquet files in java

Imbar M. picture Imbar M. · Sep 27, 2016 · Viewed 32.9k times · Source

Is there a way to create parquet files from java?

I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill.

Is there an simple way to do this, like inserting data into a sql table?

GOT IT

Thanks for the help.

Combining the answers and this link, I was able to create a parquet file and read it back with drill.

Answer

MaxNevermind picture MaxNevermind · Sep 27, 2016

ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by extending abstract Builder subclass inside of it.

Here an example from parquet creators themselves ExampleParquetWriter:

  public static class Builder extends ParquetWriter.Builder<Group, Builder> {
    private MessageType type = null;
    private Map<String, String> extraMetaData = new HashMap<String, String>();

    private Builder(Path file) {
      super(file);
    }

    public Builder withType(MessageType type) {
      this.type = type;
      return this;
    }

    public Builder withExtraMetaData(Map<String, String> extraMetaData) {
      this.extraMetaData = extraMetaData;
      return this;
    }

    @Override
    protected Builder self() {
      return this;
    }

    @Override
    protected WriteSupport<Group> getWriteSupport(Configuration conf) {
      return new GroupWriteSupport(type, extraMetaData);
    }

  }

If you don't want to use Group and GroupWriteSupport(bundled in Parquet but purposed just as an example of data-model implementation) you can go with Avro, Protocol Buffers, or Thrift in-memory data models. Here is an example using writing Parquet using Avro:

try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
        .<GenericData.Record>builder(fileToWrite)
        .withSchema(schema)
        .withConf(new Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .build()) {
    for (GenericData.Record record : recordsToWrite) {
        writer.write(record);
    }
}   

You will need these dependencies:

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-avro</artifactId>
    <version>1.8.1</version>
</dependency>

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.8.1</version>
</dependency>

Full example here.