Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

Updating a dataframe column in spark

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns. How would …

python apache-spark pyspark apache-spark-sql spark-dataframe
How do I set the driver's python version in spark?

I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .…

apache-spark pyspark
How do I check for equality using Spark Dataframe without SQL Query?

I want to select a column that equals to a certain value. I am doing this in scala and having …

scala apache-spark dataframe apache-spark-sql
Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for …

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset
Apache Spark: map vs mapPartitions?

What's the difference between an RDD's map and mapPartitions method? And does flatMap behave like map or like mapPartitions? Thanks. (…

performance scala apache-spark rdd
Split Spark Dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more …

apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
Add jars to a Spark Job - spark-submit

True ... it has been discussed quite a lot. However there is a lot of ambiguity and some of the answers …

java scala apache-spark jar spark-submit
How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script …

python scala apache-spark hadoop pyspark
PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I …

python apache-spark dataframe pyspark apache-spark-sql
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7

I'm not able to run a simple spark job in Scala IDE (Maven spark project) installed on Windows 7 Spark core …

eclipse scala apache-spark