Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Spark dataframe transform multiple rows to column

I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+…

python apache-spark dataframe apache-spark-sql rdd
Which operations preserve RDD order?

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed …

apache-spark rdd
Spill to disk and shuffle write spark

I'm getting confused about spill to disk and shuffle write. Using the default Sort shuffle manager, we use an appendOnlyMap …

apache-spark rdd shuffle
Matrix Multiplication in Apache Spark

I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD …

java scala apache-spark rdd apache-spark-mllib
RDD Aggregate in spark

I am an Apache Spark learner and have come across a RDD action aggregate which I have no clue of …

scala apache-spark rdd
How to reverse ordering for RDD.takeOrdered()?

What is the syntax to reverse the ordering for the takeOrdered() method of an RDD in Spark? For bonus points, …

apache-spark rdd
Convert an RDD to iterable: PySpark?

I have an RDD which I am creating by loading a text file and preprocessing it. I dont want to …

python apache-spark pyspark rdd
How Can I Obtain an Element Position in Spark's RDD?

I am new to Apache Spark, and I know that the core data structure is RDD. Now I am writing …

position apache-spark rdd
Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner?

apache-spark spark-streaming rdd
Compare data in two RDD in spark

I am able to print data in two RDD with the below code. usersRDD.foreach(println) empRDD.foreach(println) I …

apache-spark scala-2.10 cloudera-cdh rdd