Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

Question 1

Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

hive hql parquet hcatalog

Dave · Jul 19, 2017 · Viewed 13k times · Source

Answer

Answer

Lets try to understand from how it is implemented in API:-

org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter

Here is the magic begins -->

private DataWriter createWriter(ObjectInspector inspector, Type type) {
case stmt.....
........
case STRING:
        return new StringDataWriter((StringObjectInspector)inspector);
    case VARCHAR:
        return new VarcharDataWriter((HiveVarcharObjectInspector)inspector);

}

createWriter method of DataWritableWriter class checks for datatype of column. i.e. either varchar or string, accordingly it creates writer class for these types.

Now lets move on to VarcharDataWriter class.

private class VarcharDataWriter implements DataWriter {
    private HiveVarcharObjectInspector inspector;

    public VarcharDataWriter(HiveVarcharObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value).getValue();
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

OR

to StringDataWriter class

private class StringDataWriter implements DataWriter {
    private StringObjectInspector inspector;

    public StringDataWriter(StringObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value);
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

addBinary method in both the classes actually adds binary values of encoded datatype(encodeUTF8 encoding). And for string encoding is different than encoding of varchar.

short answer to question:- unicode encoding of string and varchar are different. storage wise it may little vary as per no. of bytes of store. But performance wise as per my understanding, hive is schema on read tool. ParquetRecordReader knows how to read a record. It just reads bytes.So there wont be any performance difference due to varchar or string datatype.

Question 2

I have a HIVE table which will hold billions of records, its a time-series data so the partition is per minute. Per minute we will have around 1 million records.

I have few fields in my table, VIN number (17 chars), Status (2 chars) ... etc

So my question is during the table creation if I choose to use Varchar(X) vs String, is there any storage or performance problem,

Few limitation of varchar are https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-string

If we provide more than "x" characters it will silently truncate, so keeping it string will be future proof.
1. Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.
2. There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.

What is the cost I have to pay for using string instead of varchar in terms of storage and performance

Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

Answer

Related questions