Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have: sc = SparkContext() sc.…

performance apache-spark pyspark rdd
Difference between DataSet API and DataFrame API

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for …

apache-spark apache-spark-sql rdd apache-spark-dataset
Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted …

python apache-spark aggregate average rdd
PySpark DataFrames - way to enumerate without converting to Pandas?

I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being …

python apache-spark bigdata pyspark rdd
How to get a sample with an exact sample size in Spark RDD?

Why does the rdd.sample() function on Spark RDD return a different number of elements even though the fraction parameter …

apache-spark sample rdd
Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

I have the following spark job, trying to keep everything in memory: val myOutRDD = myInRDD.flatMap { fp => val tuple2…

apache-spark shuffle rdd persist
Initialize an RDD to empty

I have an RDD called JavaPairRDD<String, List<String>> existingRDD; Now I need to initialize this …

java apache-spark rdd
Difference between Spark RDD's take(1) and first()

I used to think that rdd.take(1) and rdd.first() are exactly the same. However I began to wonder if …

apache-spark pyspark rdd
How to convert an RDD[Row] back to DataFrame

I've been playing around with converting RDDs to DataFrames and back again. First, I had an RDD of type (Int, …

scala apache-spark dataframe rdd
Which function in spark is used to combine two RDDs by keys

Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, …

python scala apache-spark rdd