What is the advantage of storing schema in avro?

user2250246 picture user2250246 · Dec 13, 2013 · Viewed 19k times · Source

We need to serialize some data for putting into solr as well as hadoop.

I am evaluating serialization tools for the same.

The top two in my list are Gson and Avro.

As far as I understand, Avro = Gson + Schema-In-JSON

If that is correct, I do not see why Avro is so popular for Solr/Hadoop?

I have searched a lot on the Internet, but cannot find a single correct answer for this.

Everywhere it says, Avro is good because it stores schema. My question is what to do with that schema?

It may be good for very large objects in Hadoop where a single object is stored in multiple file blocks such that storing schema with each part helps to analyze it better. But even in that case, schema can be stored separately and just a reference to that is sufficient to describe the schema. I see no reason why schema should be part of each and every piece.

If someone can give me some good use case how Avro helped them and Gson/Jackson were insufficient for the purpose, it would be really helpful.

Also, official documentation at the Avro site says that we need to give a schema to Avro to help it produce Schema+Data. My question is, if schema is input and the same is sent to output along with JSON representation of data, then what extra is being achieved by Avro? Can I not do that myself by serializing an object using JSON, adding my input schema and calling it Avro?

I am really confused with this!

Answer

Vishal John picture Vishal John · Dec 13, 2013
  1. Evolving schemas

Suppose intially you designed an schema like this for your Employee class

{
{"name": "emp_name", "type":"string"},
{"name":"dob", "type":"string"},
{"name":"age", "type":"int"}
}

Later you realized that age is redundant and removed it from the schema.

{
{"name": "emp_name", "type":"string"},
{"name":"dob", "type":"string"}
}

What about the records that were serialized and stored before this schema change. How will you read back those records?

That's why the avro reader/deserializer asks for the reader and writer schema. Internally it does schema resolution ie. it tries to adapt the old schema to new schema.

Go to this link - http://avro.apache.org/docs/1.7.2/api/java/org/apache/avro/io/parsing/doc-files/parsing.html - section "Resolution using action symbols"

In this case it does skip action, ie it leaves out reading "age". It can also handle cases like a field changes from int to long etc..

This is a very nice article explaining schema evolution - http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

  1. Schema is stored only once for multiple records in a single file.

  2. Size, encoded in very few bytes.