Nesting Avro schemas

Tianxiang Xiong picture Tianxiang Xiong · Nov 28, 2016 · Viewed 12.3k times · Source

According to this question on nesting Avro schemas, the right way to nest a record schema is as follows:

{
    "name": "person",
    "type": "record",
    "fields": [
        {"name": "firstname", "type": "string"},
        {"name": "lastname", "type": "string"},
        {
            "name": "address",
            "type": {
                        "type" : "record",
                        "name" : "AddressUSRecord",
                        "fields" : [
                            {"name": "streetaddress", "type": "string"},
                            {"name": "city", "type": "string"}
                        ]
                    },
        }
    ]
}

I don't like giving the field the name address and having to give a different name (AddressUSRecord) to the field's schema. Can I give the field and schema the same name, address?

What if I want to use the AddressUSRecord schema in multiple other schemas, not just person? If I want to use AddressUSRecord in another schema, let's say business, do I have to name it something else?

Ideally, I'd like to define AddressUSRecord in a separate schema, then let the type of address reference AddressUSRecord. However, it's not clear that Avro 1.8.1 supports this out-of-the-box. This 2014 article shows that sub-schemas need to be handled with custom code. What the best way to define reusable schemas in Avro 1.8.1?

Note: I'd like a solution that works with Confluent Inc.'s Schema Registry. There's a Google Groups thread that seems to suggest that Schema Registry does not play nice with schema references.

Answer

Niel Drummond picture Niel Drummond · Nov 29, 2016

Can I give the field and schema the same name, address?

Yes, you can name the record with the same name as the field name.

What if I want to use the AddressUSRecord schema in multiple other schemas, not just person?

You can use multiple schemas using a couple of techniques: the avro schema parser clients (JVM and others) allow you to specify multiple schemas, usually through the names parameter (the Java Schema$Parser/parse method allows multiple schema String arguments).

You can then specify dependant Schemas as a named type:

{
  "type": "record",
  "name": "Address",
  "fields": [
    {
      "name": "streetaddress",
      "type": "string"
    },
    {
      "name": "city",
      "type": "string"
    }
  ]
}

And run this through the parser before the parent schema:

{
  "name": "person",
  "type": "record",
  "fields": [
    {
      "name": "firstname",
      "type": "string"
    },
    {
      "name": "lastname",
      "type": "string"
    },
    {
      "name": "address",
      "type": "Address"
    }
  ]
}

Incidentally, this allows you to parse from separate files.

Alternatively, you can also parse a single Union schema that references schemas in the same way:

[
  {
    "type": "record",
    "name": "Address",
    "fields": [
      {
        "name": "streetaddress",
        "type": "string"
      },
      {
        "name": "city",
        "type": "string"
      }
    ]
  },
  {
    "type": "record",
    "name": "person",
    "fields": [
      {
        "name": "firstname",
        "type": "string"
      },
      {
        "name": "lastname",
        "type": "string"
      },
      {
        "name": "address",
        "type": "Address"
      }
    ]
  }
]

I'd like a solution that works with Confluent Inc.'s Schema Registry.

The schema registry does not support parsing schemas separately, but it does support the latter example of parsing into a union type.