How to read quoted CSV with NULL values into Amazon Athena

Mikolaj picture Mikolaj · Jun 6, 2018 · Viewed 11k times · Source

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:

CSV:

id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"

CREATE TABLE DEFINITION:

CREATE EXTERNAL TABLE schema.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT 
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'

CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.

I tried making the CSV look like this (quote empty string):

"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"

But it's not working.

Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.

Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.

It works, when you specify all columns as STRING but that's not what I need.

Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?

Answer

Zerodf picture Zerodf · Jun 7, 2018

Quick and dirty way to handle these data:

CSV:

id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"

DDL:

CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;

The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.

I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.

I will work on the regex later.

Edit:

Regex as promised:

CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;

enter image description here

Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.

Further Reading:

Stackoverflow - remove surrounding quotes from fields while loading data into hive

Athena - OpenCSVSerDe for Processing CSV