Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Modify collection inside a Spark RDD foreach

I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, …

scala apache-spark rdd
When to use Kryo serialization in Spark?

I am already compressing RDDs using conf.set("spark.rdd.compress","true") and persist(MEMORY_AND_DISK_SER). Will using …

scala apache-spark rdd kryo
Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession? Is there any method to convert or create a Context …

java scala apache-spark rdd apache-spark-dataset
Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes

When I run a Spark job using its example code BinaryClassification.scala with my own data, it always shows the …

scala out-of-memory apache-spark rdd
Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of …

scala apache-spark rdd spark-dataframe apache-spark-mllib
How to transpose an RDD in Spark

I have an RDD like this: 1 2 3 4 5 6 7 8 9 It is a matrix. Now I want to transpose the RDD like this: 1 4 7 2 5 8 3 6 9 How …

scala apache-spark rdd
Is groupByKey ever preferred over reduceByKey

I always use reduceByKey when I need to group data in RDDs, because it performs a map side reduce before …

apache-spark rdd
Tips for properly using large broadcast variables?

I'm using a broadcast variable about 100 MB pickled in size, which I'm approximating with: >>> data = list(range(…

python apache-spark pyspark pickle rdd
reduceByKey method not being found in Scala Spark

Attempting to run http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala from source. This line: val wordCounts = textFile.flatMap(…

scala apache-spark rdd
How to convert spark DataFrame to RDD mllib LabeledPoints?

I tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) …

scala apache-spark rdd pca apache-spark-mllib