Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Spark: Efficient way to test if an RDD is empty

There is not an isEmpty method on RDD's, so what is the most efficient way of testing if an RDD …

scala apache-spark rdd
Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate …

apache-spark filter rdd
What are the differences between sc.parallelize and sc.textFile?

I am new to Spark. can someone please clear my doubt: Lets assume below is my code: a = sc.textFile(…

apache-spark pyspark rdd
How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole …

scala apache-spark hdfs rdd bigdata
How to remove duplicate values from a RDD[PYSPARK]

I have the following table as a RDD: Key Value 1 y 1 y 1 y 1 n 1 n 2 y 2 n 2 n I want …

python apache-spark rdd
How spark read a large file (petabyte) when file can not be fit in spark's main memory

What will happen for large files in these cases? 1) Spark gets a location from NameNode for data . Will Spark stop …

apache-spark rdd partition
Spark RDD - Mapping with extra arguments

Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe: …

python apache-spark pyspark rdd
How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. …

scala apache-spark dataframe apache-spark-sql rdd
pyspark partitioning data using partitionby

I understand that partitionBy function partitions my data. If I use rdd.partitionBy(100) it will partition my data by key …

python apache-spark pyspark partitioning rdd