Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

How to convert rdd object to dataframe in spark

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.…

scala apache-spark apache-spark-sql rdd
Spark - repartition() vs coalesce()

According to Learning Spark Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an …

apache-spark distributed-computing rdd
How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark. I …

scala apache-spark dataframe apache-spark-sql rdd
Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for …

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset
Apache Spark: map vs mapPartitions?

What's the difference between an RDD's map and mapPartitions method? And does flatMap behave like map or like mapPartitions? Thanks. (…

performance scala apache-spark rdd
Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first …

apache-spark dataframe rdd
Spark specify multiple column conditions for dataframe join

How to give more column conditions when joining two dataframes. For example I want to run the following : val Lead_…

apache-spark apache-spark-sql rdd
What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

apache-spark distributed-computing rdd
Spark read file from S3 using sc.textFile ("s3n://...)

Trying to read a file located in S3 using spark-shell: scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.…

java scala apache-spark rdd hortonworks-data-platform
Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, …

python apache-spark mapreduce pyspark rdd