Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

How to sort an RDD in Scala Spark?

Reading Spark method sortByKey : sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, …

scala apache-spark rdd
How do I select a range of elements in Spark RDD?

I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a …

apache-spark rdd
Explain the aggregate functionality in Spark

I am looking for some better explanation of the aggregate functionality that is available via spark in python. The example …

python apache-spark lambda aggregate rdd
How do I get a SQL row_number equivalent for a Spark RDD?

I need to generate a full list of row_numbers for a data table with many columns. In SQL, this …

sql apache-spark row-number rdd
How does HashPartitioner work?

I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am …

scala apache-spark rdd partitioning
DataFrame equality in Apache Spark

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. …

scala apache-spark dataframe apache-spark-sql rdd
What is RDD in spark

Definition says: RDD is immutable distributed collection of objects I don't quite understand what does it mean. Is it like …

scala hadoop apache-spark rdd
Spark: RDD to List

I have a RDD structure RDD[(String, String)] and I want to create 2 Lists (one for each dimension of the …

scala list apache-spark rdd
How to partition RDD by key in Spark?

Given that the HashPartitioner docs say: [HashPartitioner] implements hash-based partitioning using Java's Object.hashCode. Say I want to partition DeviceData …

scala apache-spark rdd
How to extract an element from a array in pyspark

I have a data frame with following type col1|col2|col3|col4 xxxx|yyyy|zzzz|[1111],[2222] I want my output to …

python apache-spark pyspark rdd