Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.
In particular, if I say rdd3 = rdd1.join(rdd2) then when I call rdd3.collect, depending on the Partitioner used, …
apache-spark rddI normally read and write files in Spark using .gz, which the number of files should be the same as …
apache-spark rdd gzip bz2When I use "++" to combine a lot of RDDs, I got error stack over flow error. Spark version 1.3.1 Environment: yarn-client. …
apache-spark rddI have following DataFrame: |-----id-------|----value------|-----desc------| | 1 | v1 | d1 | | 1 | v2 | d2 | | 2 | v21 | d21 | | 2 | v22 | d22 | |--------------|---------------|---------------| I want …
scala apache-spark group-concat rdd spark-dataframeWhen we talk about RDD graphs, does it mean lineage graph or DAG (direct acyclic graph) or both? and when …
apache-spark rdd directed-acyclic-graphsWe can persist an RDD into memory and/or disk when we want to use it more than once. However, …
apache-spark hadoop rdd distributed-computingI have the following code: val data = input.map{... }.persist(StorageLevel.MEMORY_ONLY_SER).repartition(2000) I am wondering what's the …
apache-spark rdd partition persist