AWS Athena create table and partition

moin moin picture moin moin · Dec 1, 2018 · Viewed 9.5k times · Source

I stored my sensor data in S3 (write data every 5 minutes):

farm_iot/sensor_data/farm/farm0001/sensor01/1541252701443

1541252701443 is a json file containing measurements:

{  "temperature": 14.78,  "pressure": 961.70,  "humidity": 68.32}

I am definitely missing some hive skill. Unfortunately I did not find an example that extracts timeseries json data that gets me started. I am also not sure wheather Hive / Athena does support this kind of data wresting.

I am struggeling with creating an Athena table for this data...

CREATE EXTERNAL TABLE IF NOT EXISTS farm.sensor_data (
  device string,
  sensor string,
  data_point string,
  value double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://farm-iot/sensor_data/farm/farm0001/sensor01/'
PARTITIONED BY (timestamp string)
TBLPROPERTIES ('has_encrypted_data'='false')

Another road I am thinking is to store the data in a structure that is easier to process / maybe I have not partitioned the data enough??!

so maybe I should add dt to the structure like this:

farm_iot/sensor_data/2018-11-03-02-45-02/farm/farm0001/sensor01/1541252701443

still does not get me where I want to be:

+---------------+----------+----------+-------------+--------+
| timestamp     | device   | sensor   | data_point  | value  |
+---------------+----------+----------+-------------+--------+
| 1541252701443 | farm0001 | sensor01 | temperature |  14.78 |
+---------------+----------+----------+-------------+--------+
| 1541252701443 | farm0001 | sensor01 | humidity    |  68.32 |
+---------------+----------+----------+-------------+--------+
| 1541252701443 | farm0001 | sensor01 | pressure    | 961.70 |
+---------------+----------+----------+-------------+--------+

Any pointer towards this goal would be much appreciated. Thank you!

please note: I do not want to use glue and like to understand how to do it manually. besides glue already created ~16.000 tables yesterday :)

Answer

hlagos picture hlagos · Dec 4, 2018

Let me try to explain a few problems that I see on front.

  • It looks like your desired output expect some data which is part of the path file location, device and sensor, however it is not defined as part of your table definition, only columns in the table definition or virtual columns will be available.
  • Several small files could impact the performance of your queries (but this will not affect your desired result)
  • Hive partitions are used to improve queries' performance avoiding scan the all the data. Partitions point to folders, in this case you are trying to access to specific files
  • Your desired output is basically exploding 1 record in several records, this shouldn't be handled at table definition, can be done through your select statement
  • Hive partitions have the naming convention of partitionname=partitionvalue, this is not mandatory but useful if you want to take advance of commands to automatically add partitions based on your folder structures.

This is how I would solve your problem if you will mainly query by sensor or device

Change structure of your data

Your folder sturcture ideally should go from

farm_iot/sensor_data/farm/farm0001/sensor01/1541252701443

to farm_iot/sensor_data/farm/device=farm0001/sensor=sensor01/1541252701443

Change your table definition

Your table definition should contain your partition locations to be able to select it without regex and take advantage of the performance improvement of it (I am guessing a common query will filter by device or sensor. Additional to that you need to add all your json columns that are part of your file

CREATE EXTERNAL TABLE IF NOT EXISTS farm.sensor_data (
  temperature double,
  preassure double,
  humidity double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://farm-iot/sensor_data/farm/'
PARTITIONED BY (device string, sensor string)
TBLPROPERTIES ('has_encrypted_data'='false')

Querying your data

We are missing the timestamp which is essentially part of your filename with the json input. We can include the file name during the select statement using virtual column INPUT__FILE__NAME as follow

select device, sensor, temperature, preassure, humidity, INPUT__FILE__NAME as mytimestamp from farm.sensor_data

If you want preassure, temperature and humidity and different rows, I would recommend create an array with those three and explode it, it should be much efficient that run 3 queries using UNION ALL to append the results

Adding new partition

If you follow Hive convention, you can take advantage of the command msck repair table to automatically add new partitions once new devices/sensors are included. In the worst case if you want to keep your folder structure, you can add partitions as follow

ALTER TABLE test ADD PARTITION (device='farm0001', sensor='sensor01') location 's3://farm_iot/sensor_data/farm/farm0001/sensor01'

NOTE: new partitions will not be automatically added, you always need to add them

I tried to add as much detail as possible. If something is not clear let me know.

EDIT: If your queries will be mostly based on timeseries (date range for example) I would recommend add a partition at day level (not smaller than this) to improve the performance of your queries. So your table definition would looks like

CREATE EXTERNAL TABLE IF NOT EXISTS farm.sensor_data (
  temperature double,
  preassure double,
  humidity double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://farm-iot/sensor_data/farm/'
PARTITIONED BY (dt=long, device string, sensor string)
TBLPROPERTIES ('has_encrypted_data'='false')

And your folder structure would looks like

farm_iot/sensor_data/farm/dt=20191204/device=farm0001/sensor=sensor01/1541252701443

As clarification, you do not need to modify the table for each new partition, only add this partitions to the table, this is essentially how Hive will know that a new partition was created. If you decide to use partitions, this is the only way, if you don't (this will impact the performance), there are some other alternatives to make it work

EDIT2:

If you want to keep your data structure as is and do not use partitions, it is possible to get the expected results as follow

CREATE EXTERNAL TABLE IF NOT EXISTS yourdb.sensordata (
  temperature double,
  pressure double,
  humidity double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) 
LOCATION 's3://farm-iot/sensor_data/farm/'
TBLPROPERTIES ('has_encrypted_data'='false');

SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
select * from yourdb.sensordata;

select 
split(input__file__name, "/")[size(split(input__file__name, "/")) - 1] as ts,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 3] as device,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 2] as sensor,
'temperature' as data_point,
temperature as value
from yourdb.sensordata
union all
select 
split(input__file__name, "/")[size(split(input__file__name, "/")) - 1] as ts,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 3] as device,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 2] as sensor,
'pressure' as data_point,
pressure as value
from yourdb.sensordata
union all
select 
split(input__file__name, "/")[size(split(input__file__name, "/")) - 1] as ts,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 3] as device,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 2] as sensor,
'humidity' as data_point,
humidity as value
from yourdb.sensordata;

As you see, I am getting the most of the information from the file path, however it is required to set some flags to tell Hive read folders recursively

ts,device,sensor,_data_point,value
1541252701443,farm0001,sensor01,temperature,14.78
1541252701443,farm0001,sensor01,pressure,961.7
1541252701443,farm0001,sensor01,humidity,68.32