Convert file of JSON objects to Parquet file

danieltahara picture danieltahara · Feb 11, 2014 · Viewed 11.3k times · Source

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.

Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?

Answer

blue picture blue · May 28, 2015

Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.

First, you would infer the schema of your JSON:

kite-dataset json-schema sample-file.json -o schema.avsc

Then you can use that file to create a Parquet Hive table:

kite-dataset create mytable --schema schema.avsc --format parquet

And finally, you can load your JSON into the dataset.

kite-dataset json-import sample-file.json mytable

You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.