Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Pyspark: repartition vs partitionBy

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've …

apache-spark pyspark rdd
How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala …

apache-spark pyspark rdd
How do you perform basic joins of two RDD tables in Spark using Python?

How would you perform basic joins in Spark using python? In R you could use merg() to do this. What …

python join apache-spark pyspark rdd
Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") …

apache-spark spark-dataframe rdd apache-spark-2.0 bigdata
Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run …

scala performance apache-spark pyspark rdd
Concatenating datasets of different RDDs in Apache spark using scala

Is there a way to concatenate datasets of two different RDDs in spark? Requirement is - I create two intermediate …

scala apache-spark apache-spark-sql distributed-computing rdd
Convert a simple one line string to RDD in Spark

I have a simple line: line = "Hello, world" I would like to convert it to an RDD with only one …

python apache-spark pyspark distributed-computing rdd
How DAG works under the covers in RDD?

The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast …

apache-spark rdd directed-acyclic-graphs
Scala Spark : How to create a RDD from a list of string and convert to DataFrame

I want to create a DataFrame from a list of string that could match existing schema. Here is my code. …

scala apache-spark dataframe rdd union-all
reduceByKey: How does it work internally?

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we …

scala apache-spark rdd