Top "Rdd" questions

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

How to find spark RDD/Dataframe size?

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark? …

scala apache-spark rdd
How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is …

python apache-spark median rdd pyspark
(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we …

scala apache-spark rdd
Spark union of multiple RDDs

In my pig code I do this: all_combined = Union relation1, relation2, relation3, relation4, relation5, relation 6. I want to do …

python apache-spark pyspark rdd
How to read from hbase using spark

The below code will read from the hbase, then convert it to json structure and the convert to schemaRDD , But …

hbase apache-spark rdd
How to convert Spark RDD to pandas dataframe in ipython?

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD …

python pandas ipython pyspark rdd
Join two ordinary RDDs with/without Spark SQL

I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join …

scala join apache-spark rdd apache-spark-sql
Apache spark dealing with case statements

I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how …

apache-spark pyspark spark-dataframe rdd pyspark-sql
How to get element by Index in Spark RDD (Java)

I know the method rdd.firstwfirst() which gives me the first element in an RDD. Also there is the method …

java apache-spark rdd
'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (…

python apache-spark pyspark apache-spark-sql rdd