Hive External Table Skip First Row

Rick Gittins picture Rick Gittins · Apr 1, 2013 · Viewed 111.7k times · Source

I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that.

CREATE EXTERNAL TABLE Test ( 
  RecordId int, 
  FirstName string, 
  LastName string 
) 
ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' 
WITH SerDeProperties (  
  "separatorChar" = ","
) 
STORED AS TEXTFILE 
LOCATION '/user/File.csv'

Sample Data

RecordId,FirstName,LastName
1,"John","Doe"
2,"Jane","Doe"

Can anyone help me with how to skip the first row or do I need to add an intermediate step?

Answer

Daniel Koverman picture Daniel Koverman · Apr 2, 2013

Header rows in data are a perpetual headache in Hive. Short of modifying the Hive source, I believe you can't get away without an intermediate step. (Edit: This is no longer true, see update below)

Unfortunately, that answers you question. I'll throw in some ideas for the intermediate step for completeness.

You can get away without an extra step in your data load if you are willing to filter out the header row on every query that touches the table. Unfortunately this adds an extra set just about everywhere else. And you will have to get clever/messy when the header row violates your schema. If you go with this approach, you might consider writing a custom SerDe that makes this row easier to filter. Unfortunately, SerDe's cannot remove the row entirely (or that might form a possible solution), they must return something like null. I've never seen this approach taken in practice to deal with header rows since it makes reading a pain, and reading tends to be much more common than writing. It might have a place if you are dealing with one-of tables or if the header row is just one row among many malformed rows.

You could do this filtering once with variations on deleting that first row in data load. A WHERE clause in an INSERT statement would do it. You could use utilities like sed to get rid of it. I've seen both approaches taken. There are trade-offs between which approach you take and neither is the one true way to deal with header rows. Unfortunately, both these approaches take time and require temporary duplication of the data. If you absolutely need the header row for another application, the duplication would be permanent.

Update:

From Hive v0.13.0, you can use skip.header.line.count. You could also specify the same while creating the table. For example:

create external table testtable (name string, message string)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
location '/testtable'
tblproperties ("skip.header.line.count"="1");