Following is the gist of my problem.
Env: Hadoop 2 (CDH5.1) database: oracle 11g
Scenarios: I'm sqooping fact and dimension tables from the database into hdfs. Initially, I had challenges in handling nulls (which was handled using --null-string and --non-null-string) which was set to \N as per the recommendation. Everything was fine when the hive table that was built had string fields even for date and numerics.
Solution so far Based on a recommendation, I move to importing using the Avro format. I've built the hive table on the avro data and I'm able to query the tables. Now I need to create Hive joins and convert all the fields to their required type like dates to be dates/timestamps, numerics to be int/bigint etc. After the sqooping the avro schema created had converted all date fields to long and the hive table show bigint for those columns.
I'm confused around how sqoop is handling nulls and how those are to be handled in hive/hdfs MR etc.
Could you anybody suggest any practice that has been adopted that could be leveraged?
Thanks Venkatesh
It was a problem for me too. When I improted schema from parquet tables.. as Parquet stores timestamp as bigint. So I guess the underlying problem is parquet that does not have a separate datatype to store timestamp. Don't use AVRO very often, but I think it is true for AVRO too. So if you sqoop from Oracle date/timestamp into a set of parquet/avro files, then storage type (bigint) is how it is stored, not how you want to access it as (timestamp/date).
That time is stored as number of milliseconds from UNIX epoch time (Jan 1st 1970). There are Hive/Spark/Impala functions from_unixtime() that take number of seconds so the solution is to convert those ms values to s resolution:
SELECT ..
, from_unixtime(cast(bigint_column/1000 as bigint))
So you will see timestamps like:
1999-04-14 06:00:00
1999-04-15 06:00:00
Notice 6 hours shift. In my case original Oracle's data type was DATE without any time part (00:00:00), but I got time shifted by 06 hours because of my timezone (MST). So to get exact dates:
SELECT ..
, from_unixtime(cast(bigint_column/1000 - 6*3600 as bigint))
which resulted in:
1999-04-14 00:00:00
1999-04-15 00:00:00
ps. "Data Type Considerations for Parquet Tables" http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_parquet.html#parquet_data_types_unique_1 :
INT96 -> TIMESTAMP