How to set and get static variables from spark?

diplomaticguru picture diplomaticguru · Apr 16, 2015 · Viewed 14.1k times · Source

I have a class as this:

public class Test {
    private static String name;

    public static String getName() {
        return name;
    }

    public static void setName(String name) {
        Test.name = name;
    }

    public static void print() {
        System.out.println(name);
    }

}

Inside my Spark driver, I'm setting the name like this and calling the print() command:

public final class TestDriver{

    public static void main(String[] args) throws Exception {
        SparkConf sparkConf = new SparkConf().setAppName("TestApp");
        // ...
        // ...
        Test.setName("TestName")
        Test.print();
        // ...
    }
}

However, I'm getting a NullPointerException. How do I pass a value to the global variable and use it?

Answer

Daniel Langdon picture Daniel Langdon · Apr 17, 2015

Ok, there is basically 2 ways to take a value known to the master to the executors:

  1. Put the value inside a closure to be serialized to the executors to perform a task. This is the most common one and very simple/elegant. Sample and doc here.
  2. Create a broadcast variable with the data. This is good for immutable data of a big size, so you want to guarantee it is send only once. Also good if the same data is used over and over. Sample and doc here.

No need to use static variables in either case. But, if you DO want to have static values available on your executor VMs, you need to do one of these:

  1. If the values are fixed or the configuration is available on the executor nodes (lives inside the jar, etc), then you can have a lazy val, guaranteeing initialization only once.
  2. You can call mapPartitions() with code that uses one of the 2 options above, then store the values on your static variable/object. mapPartitions is guaranteed to run only once for each partition (much better than once per line) and is good for this kind of thing (initializing DB connections, etc).

Hope this helps!

P.S: As for you exception: I just don't see it on that code sample, my bet is that it is occurring elsewhere.


Edit for extra clarification: The lazy val solution is simply Scala, no Spark involved...

object MyStaticObject
{
  lazy val MyStaticValue = {
     // Call a database, read a file included in the Jar, do expensive initialization computation, etc
     4
  }
} 

Since each Executor corresponds to a JVM, once the classes are loaded MyStaticObject will be initialized. The lazy keyword guarantees that the MyStaticValue variable will only be initialized the first time it is actually requested, and hold its value ever since.