I have a HIVE table which will hold billions of records, its a time-series data so the partition is per minute. Per minute we will have around 1 million records.
I have few fields in my table, VIN number (17 chars), Status (2 chars) ... etc
So my question is during the table creation if I choose to use Varchar(X) vs String, is there any storage or performance problem,
Few limitation of varchar are https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-string
If we provide more than "x" characters it will silently truncate, so keeping it string will be future proof.
Non-generic UDFs cannot directly use varchar type as input arguments or return values. String UDFs can be created instead, and the varchar values will be converted to strings and passed to the UDF. To use varchar arguments directly or to return varchar values, create a GenericUDF.
There may be other contexts which do not support varchar, if they rely on reflection-based methods for retrieving type information. This includes some SerDe implementations.
What is the cost I have to pay for using string instead of varchar in terms of storage and performance
Lets try to understand from how it is implemented in API:-
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter
Here is the magic begins -->
private DataWriter createWriter(ObjectInspector inspector, Type type) {
case stmt.....
........
case STRING:
return new StringDataWriter((StringObjectInspector)inspector);
case VARCHAR:
return new VarcharDataWriter((HiveVarcharObjectInspector)inspector);
}
createWriter method of DataWritableWriter class checks for datatype of column. i.e. either varchar
or string
, accordingly it creates writer class for these types.
Now lets move on to VarcharDataWriter
class.
private class VarcharDataWriter implements DataWriter {
private HiveVarcharObjectInspector inspector;
public VarcharDataWriter(HiveVarcharObjectInspector inspector) {
this.inspector = inspector;
}
@Override
public void write(Object value) {
String v = inspector.getPrimitiveJavaObject(value).getValue();
recordConsumer.addBinary(Binary.fromString(v));
}
}
OR
to StringDataWriter
class
private class StringDataWriter implements DataWriter {
private StringObjectInspector inspector;
public StringDataWriter(StringObjectInspector inspector) {
this.inspector = inspector;
}
@Override
public void write(Object value) {
String v = inspector.getPrimitiveJavaObject(value);
recordConsumer.addBinary(Binary.fromString(v));
}
}
addBinary method in both the classes actually adds binary values of encoded datatype(encodeUTF8 encoding). And for string encoding is different than encoding of varchar.
short answer to question:- unicode encoding of string and varchar are different. storage wise it may little vary as per no. of bytes of store. But performance wise as per my understanding, hive is schema on read
tool. ParquetRecordReader
knows how to read a record. It just reads bytes.So there wont be any performance difference due to varchar or string datatype.