I'm receiving a set of (1 Mb) CSV/JSON files on S3 that I would like to convert to Parquet. I was expecting to be able of converting this files easily to Parquet using a Lambda function.
After looking on Google I didn't found a solution to this without have some sort of Hadoop.
Since this is a file conversion, I can't believe there is not an easy solution for this. Someone has some Java/Scala sample code to do this conversion?
If your input JSON files are not large (< 64 MB, beyond which lambda is likely to hit memory caps) and either have simple data types or you are willing to flatten the structs, you might consider using pyarrow, even though the route is slightly convoluted.
It involved using Pandas:
df = pd.read_json(file.json)
followed by converting it into parquet file:
pq = pa.parquet.write_table(df, file.pq)
Above example does auto-inference of data-types. You can override it by using the argument dtype
while loading JSON. It's only major shortcoming is that pyarrow only supports string, bool, float, int, date, time, decimal, list, array
.
Update (a more generic solution):
Consider using json2parquet.
However, if the input data has nested dictionaries, it first needs to be flattened, i.e. convert:
{a: {b: {c: d}}} to {a.b.c: d}
Then, this data needs to be ingested as pyarrow batch with json2parquet:
pa_batch = j2p.ingest_data(data)
and now the batch can be loaded as PyArrow dataframe:
df = pa.Table.from_batches([pa_batch])
and output in parquet file:
pq = pa.parquet.write_table(df, file.pq)