Athena and S3 Inventory. HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema

duzvik picture duzvik · Jan 16, 2018 · Viewed 8.4k times · Source

I'm trying to understand how to work with s3 inventory. I'm following this tutorial

After loading inventory lists into my table I'm trying to query it and find two issues.

1) SELECT key, size FROM table; Size column for all records show a magic number(value) 4923069104295859283

2) select * from table; Query Id: cf07c309-c685-4bf4-9705-8bca69b00b3c.

Receiving error:

HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema

Here is my table schema:

CREATE EXTERNAL TABLE `table`(
`bucket` string, 
`key` string, 
`version_id` string, 
`is_latest` boolean, 
`is_delete_marker` boolean, 
`size` bigint, 
`last_modified_date` timestamp, 
`e_tag` string, 
`storage_class` string)
PARTITIONED BY ( 
`dt` string)
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://......../hive'
TBLPROPERTIES (
'transient_lastDdlTime'='1516093603')

Answer

herve picture herve · Jan 28, 2018

The following command of any of your orc file coming from the inventory generated by AWS S3 will give you the actual structure of your inventory:

$> hive --orcfiledump ~/Downloads/017c2014-1205-4431-a30d-2d9ae15492d6.orc
...
Processing data file /tmp/017017c2014-1205-4431-a30d-2d9ae15492d6.orc [length: 4741786]
Structure for /mp/017c2014-1205-4431-a30d-2d9ae15492d6.orc
File Version: 0.12 with ORC_135
Rows: 223473
Compression: ZLIB
Compression size: 262144
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string>
...

It appears that the example provided by aws here expects that your inventory is not just for the current version but for all versions of objects in your bucket.

The right table structure for Athena is then for an encrypted bucket:

CREATE EXTERNAL TABLE inventory(
  bucket string,
  key string,
  version_id string,
  is_latest boolean,
  is_delete_marker boolean,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://............/hive'
  TBLPROPERTIES ('has_encrypted_data'='true');