Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism? I have tried to set both of them …

performance apache-spark hadoop apache-spark-sql
Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL …

python apache-spark pyspark apache-spark-sql
Spark - Error "A master URL must be set in your configuration" when submitting an app

I have an Spark app which runs with no problem in local mode,but have some problems when submitting to …

scala apache-spark
How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it …

apache-spark apache-spark-sql
how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into …

scala apache-spark
Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey

Can anyone explain the difference between reducebykey,groupbykey,aggregatebykey and combinebykey? I have read the documents regarding this , but couldn't …

apache-spark
Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext. If I explicitly set it as …

apache-spark config pyspark
Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, …

python apache-spark mapreduce pyspark rdd
How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I …

python apache-spark join pyspark apache-spark-sql
How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load …

apache-spark apache-spark-sql spark-dataframe