Require kryo serialization in Spark (Scala)

pheaver picture pheaver · Jul 13, 2015 · Viewed 23.1k times · Source

I have kryo serialization turned on with this:

conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )

I want to ensure that a custom class is serialized using kryo when shuffled between nodes. I can register the class with kryo this way:

conf.registerKryoClasses(Array(classOf[Foo]))

As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.

To guarantee that kryo serialization happens, I followed this recommendation from the Spark documentation:

conf.set("spark.kryo.registrationRequired", "true")

But this causes IllegalArugmentException to be thrown ("Class is not registered") for a bunch of different classes which I assume Spark uses internally, for example the following:

org.apache.spark.util.collection.CompactBuffer
scala.Tuple3

Surely I do not have to manually register each of these individual classes with kryo? These serializers are all defined in kryo, so is there a way to automatically register all of them?

Answer

Daniel Darabos picture Daniel Darabos · Jul 14, 2015

As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.

No. If you set spark.serializer to org.apache.spark.serializer. KryoSerializer then Spark will use Kryo. If Kryo is not available, you will get an error. There is no fallback.

So what is this Kryo registration then?

When Kryo serializes an instance of an unregistered class it has to output the fully qualified class name. That's a lot of characters. Instead, if a class has been pre-registered, Kryo can just output a numeric reference to this class, which is just 1-2 bytes.

This is especially crucial when each row of an RDD is serialized with Kryo. You don't want to include the same class name for each of a billion rows. So you pre-register these classes. But it's easy to forget to register a new class and then you're wasting bytes again. The solution is to require every class to be registered:

conf.set("spark.kryo.registrationRequired", "true")

Now Kryo will never output full class names. If it encounters an unregistered class, that's a runtime error.

Unfortunately it's hard to enumerate all the classes that you are going to be serializing in advance. The idea is that Spark registers the Spark-specific classes, and you register everything else. You have an RDD[(X, Y, Z)]? You have to register classOf[scala.Tuple3[_, _, _]].

The list of classes that Spark registers actually includes CompactBuffer, so if you see an error for that, you're doing something wrong. You are bypassing the Spark registration procedure. You have to use either spark.kryo.classesToRegister or spark.kryo.registrator to register your classes. (See the config options. If you use GraphX, your registrator should call GraphXUtils. registerKryoClasses.)