Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

What is a glom?. How it is different from mapPartitions?

I've come across the glom() method on RDD. As per the documentation Return an RDD created by coalescing all elements …

apache-spark rdd
How does Sparks RDD.randomSplit actually split the RDD

So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of …

apache-spark rdd
Is Spark RDD cached on worker node or driver node (or both)?

Can any one please correct my understanding on persisting by Spark. If we have performed a cache() on an RDD …

apache-spark apache-spark-sql rdd
Why does Spark RDD partition has 2GB limit for HDFS?

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default …

scala apache-spark rdd
Random numbers generation in PySpark

Lets start with a simple function which always returns a random integer: import numpy as np def f(x): return …

python random apache-spark pyspark rdd
How to solve type mismatch when compiler finds Serializable instead of the match type?

I have have the following parser to parse arithmetic expressions containing Float and RDD : import scalaz._ import Scalaz._ def term2: …

scala parsing rdd type-mismatch scalaz7
Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

The Apache Spark pyspark.RDD API docs mention that groupByKey() is inefficient. Instead, it is recommend to use reduceByKey(), aggregateByKey(), …

apache-spark rdd pyspark
Programmatically generate the schema AND the data for a dataframe in Apache Spark

I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from …

apache-spark dataframe spark-dataframe rdd spark-csv
Apache Spark RDD filter into two RDDs

I need to split an RDD into 2 parts: 1 part which satisfies a condition; another part which does not. I can …

apache-spark rdd
Spark RDD - is partition(s) always in RAM?

We all know Spark does the computation in memory. I am just curious on followings. If I create 10 RDD in …

hadoop apache-spark pyspark hdfs rdd