Popular "rdd" questions | Page 10

In particular, if I say rdd3 = rdd1.join(rdd2) then when I call rdd3.collect, depending on the Partitioner used, …

apache-spark rdd

Spark: difference when read in .gz and .bz2

I normally read and write files in Spark using .gz, which the number of files should be the same as …

apache-spark rdd gzip bz2

Spark when union a lot of RDD throws stack overflow error

When I use "++" to combine a lot of RDDs, I got error stack over flow error. Spark version 1.3.1 Environment: yarn-client. …

apache-spark rdd

Spark: group concat equivalent in scala rdd

I have following DataFrame: |-----id-------|----value------|-----desc------| | 1 | v1 | d1 | | 1 | v2 | d2 | | 2 | v21 | d21 | | 2 | v22 | d22 | |--------------|---------------|---------------| I want …

scala apache-spark group-concat rdd spark-dataframe

What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations

When we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when …

apache-spark rdd directed-acyclic-graphs

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

We can persist an RDD into memory and/or disk when we want to use it more than once. However, …

apache-spark hadoop rdd distributed-computing

Spark: persist and repartition order

I have the following code: val data = input.map{... }.persist(StorageLevel.MEMORY_ONLY_SER).repartition(2000) I am wondering what's the …

apache-spark rdd partition persist

Top "Rdd" questions