Difference between Avrodata file and Sequence file with respect to Apache sqoop

SparkOn picture SparkOn · Jun 16, 2014 · Viewed 8.4k times · Source

In sqoop's perspective what is the difference between importing a relational table as a sequence file like-

sqoop import --connect connectionString \  
--username userName  –P --table tableName \ 
--as-sequencefile

and importing it as a avrodata file like-

sqoop import --connect connectionString \  
--username userName  –P --table tableName \ 
--as-avrodatafile

What is the actual difference between sequence file and avrodata file?

Answer

dpsdce picture dpsdce · Jun 16, 2014

SequenceFiles are a binary format that store individual records in custom record-specific data types. This format supports exact storage of all data in binary representations, and is appropriate for storing binary data (for example, VARBINARY columns), or data that will be principly manipulated by custom MapReduce programs (reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed).

Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.

here's a comparison, by Doug Cutting himself:

http://www.quora.com/What-are-the-advantages-of-Avros-object-container-file-format-over-the-SequenceFile-container-format