Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

How to read multiple text files into a single RDD?

I want to read a bunch of text files from a hdfs location and perform mapping on it in an …

apache-spark
Spark SQL: apply aggregate functions to a list of columns

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when …

apache-spark dataframe apache-spark-sql aggregate-functions
How to create an empty DataFrame with a specified schema?

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I …

scala apache-spark dataframe apache-spark-sql
Spark - SELECT WHERE or filtering?

What's the difference between selecting with a where clause and filtering in Spark? Are there any use cases in which …

apache-spark apache-spark-sql
how to loop through each row of dataFrame in pyspark

E.g sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show() The above statement print …

apache-spark dataframe for-loop pyspark apache-spark-sql
How to export data from Spark SQL to CSV

This command works with HiveQL: insert overwrite directory '/data/home.csv' select * from testtable; But with Spark SQL I'm …

hadoop apache-spark export-to-csv hiveql apache-spark-sql
PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = …

apache-spark hive pyspark apache-spark-sql hiveql
Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below …

python apache-spark dataframe pyspark apache-spark-sql
How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark. I …

scala apache-spark dataframe apache-spark-sql rdd
Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,…

apache-spark apache-spark-sql pyspark