here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys. I found this post helpful:
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
@Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.