How to traverse/iterate a Dataset in Spark Java?

Abhishek Vk picture Abhishek Vk · Mar 13, 2017 · Viewed 25.8k times · Source

I am trying to traverse a Dataset to do some string similarity calculations like Jaro winkler or Cosine Similarity. I convert my Dataset to list of rows and then traverse with for statement which is not efficient spark way to do it. So I am looking forward for a better approach in Spark.

public class sample {

    public static void main(String[] args) {
        JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Example").setMaster("local[*]"));
        SQLContext sqlContext = new SQLContext(sc);
        SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();

        List<Row> data = Arrays.asList(RowFactory.create("Mysore","Mysuru"),
                RowFactory.create("Name","FirstName"));
        StructType schema = new StructType(
                new StructField[] { new StructField("Word1", DataTypes.StringType, true, Metadata.empty()),
                        new StructField("Word2", DataTypes.StringType, true, Metadata.empty()) });

        Dataset<Row> oldDF = spark.createDataFrame(data, schema);
        oldDF.show();
        List<Row> rowslist = oldDF.collectAsList(); 
    }
}

I have found many JavaRDD examples which I am not clear. An Example for Dataset will help me a lot.

Answer

abaghel picture abaghel · Mar 13, 2017

You can use org.apache.spark.api.java.function.ForeachFunction like below.

oldDF.foreach((ForeachFunction<Row>) row -> System.out.println(row));