I have a table that has partitions and I use avro files or text files to create and insert into a table.
Once the table is done, is there a way to convert into parquet.
I mean I know we could have done say CREATE TABLE default.test( name_id STRING)
PARTITIONED BY ( year INT, month INT, day INT ) STORED AS PARQUET
initially while creating the table itself.
In my use case I 'll have to use textfiles initially. This is because I want to avoid creating multiple files inside of partition folders everytime I insert or update. My table has a very high number of inserts and updates and this is creating a drop in performance.
Is there a way I could convert into parquet after the table is created and data inserted?
You can create a table on your data in hdfs which can be stored as text, avro, or whatever format.
Then you can create another table using:
CREATE TABLE x_parquet LIKE x_non_parquet STORED AS PARQUET;
You can then set compression to something like snappy or gzip:
SET PARQUET_COMPRESSION_CODEC=snappy;
Then you can get data from the non parquet table and insert it into the new parquet backed table:
INSERT INTO x_parquet select * from x_non_parquet;
Now if you want to save space and avoid confusion, I'd automate this for any data ingestion and then delete the original non parquet format. This will help your queries run faster and cause your data to take up less space.