Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

Spark get collection sorted by value

I was trying this tutorial http://spark.apache.org/docs/latest/quick-start.html I first created a collection from a …

sorting apache-spark word-count
What are SparkSession Config Options

I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already …

json apache-spark spark-notebook
Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows: [Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), …

python timestamp apache-spark pyspark
What are workers, executors, cores in Spark Standalone cluster?

I read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the …

apache-spark distributed-computing
How to find spark RDD/Dataframe size?

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark? …

scala apache-spark rdd
How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark …

python scala apache-spark apache-spark-sql pyspark
DataFrame join optimization - Broadcast Hash Join

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. …

apache-spark dataframe apache-spark-sql apache-spark-1.4
How to perform union on two DataFrames with different amounts of columns in spark?

I have 2 DataFrames as followed : I need union like this: The unionAll function doesn't work because the number and the …

apache-spark pyspark apache-spark-sql
How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is …

python apache-spark median rdd pyspark
PySpark - rename more than one column using withColumnRenamed

I want to change names of two columns using spark withColumnRenamed function. Of course, I can write: data = sqlContext.createDataFrame([(1,2), (3,4)], […

apache-spark pyspark apache-spark-sql rename