Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Is there an "Explain RDD" in spark

In particular, if I say rdd3 = rdd1.join(rdd2) then when I call rdd3.collect, depending on the Partitioner used, …

apache-spark rdd
Spark: difference when read in .gz and .bz2

I normally read and write files in Spark using .gz, which the number of files should be the same as …

apache-spark rdd gzip bz2
Spark when union a lot of RDD throws stack overflow error

When I use "++" to combine a lot of RDDs, I got error stack over flow error. Spark version 1.3.1 Environment: yarn-client. …

apache-spark rdd
Spark: group concat equivalent in scala rdd

I have following DataFrame: |-----id-------|----value------|-----desc------| | 1 | v1 | d1 | | 1 | v2 | d2 | | 2 | v21 | d21 | | 2 | v22 | d22 | |--------------|---------------|---------------| I want …

scala apache-spark group-concat rdd spark-dataframe
What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations

When we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when …

apache-spark rdd directed-acyclic-graphs
Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

We can persist an RDD into memory and/or disk when we want to use it more than once. However, …

apache-spark hadoop rdd distributed-computing
Spark: persist and repartition order

I have the following code: val data = input.map{... }.persist(StorageLevel.MEMORY_ONLY_SER).repartition(2000) I am wondering what's the …

apache-spark rdd partition persist