Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

Dave picture Dave · Jul 19, 2017 · Viewed 13k times · Source

I have a HIVE table which will hold billions of records, its a time-series data so the partition is per minute. Per minute we will have around 1 million records.

I have few fields in my table, VIN number (17 chars), Status (2 chars) ... etc

So my question is during the table creation if I choose to use Varchar(X) vs String, is there any storage or performance problem,

Few limitation of varchar are https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-string

  1. If we provide more than "x" characters it will silently truncate, so keeping it string will be future proof.

    1. Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.

    2. There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.

What is the cost I have to pay for using string instead of varchar in terms of storage and performance

Answer

sumitya picture sumitya · Jul 21, 2017

Lets try to understand from how it is implemented in API:-

org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter 

Here is the magic begins -->

private DataWriter createWriter(ObjectInspector inspector, Type type) {
case stmt.....
........
case STRING:
        return new StringDataWriter((StringObjectInspector)inspector);
    case VARCHAR:
        return new VarcharDataWriter((HiveVarcharObjectInspector)inspector);

}

createWriter method of DataWritableWriter class checks for datatype of column. i.e. either varchar or string, accordingly it creates writer class for these types.

Now lets move on to VarcharDataWriter class.

private class VarcharDataWriter implements DataWriter {
    private HiveVarcharObjectInspector inspector;

    public VarcharDataWriter(HiveVarcharObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value).getValue();
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

OR

to StringDataWriter class

private class StringDataWriter implements DataWriter {
    private StringObjectInspector inspector;

    public StringDataWriter(StringObjectInspector inspector) {
      this.inspector = inspector;
    }

    @Override
    public void write(Object value) {
      String v = inspector.getPrimitiveJavaObject(value);
      recordConsumer.addBinary(Binary.fromString(v));
    }
  }

addBinary method in both the classes actually adds binary values of encoded datatype(encodeUTF8 encoding). And for string encoding is different than encoding of varchar.

short answer to question:- unicode encoding of string and varchar are different. storage wise it may little vary as per no. of bytes of store. But performance wise as per my understanding, hive is schema on read tool. ParquetRecordReader knows how to read a record. It just reads bytes.So there wont be any performance difference due to varchar or string datatype.