Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,…

apache-spark apache-spark-sql pyspark
Updating a dataframe column in spark

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns. How would …

python apache-spark pyspark apache-spark-sql spark-dataframe
How do I check for equality using Spark Dataframe without SQL Query?

I want to select a column that equals to a certain value. I am doing this in scala and having …

scala apache-spark dataframe apache-spark-sql
Difference between DataFrame, Dataset, and RDD in Spark

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for …

dataframe apache-spark apache-spark-sql rdd apache-spark-dataset
Split Spark Dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more …

apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I …

python apache-spark dataframe pyspark apache-spark-sql
Provide schema while reading csv file as a dataframe

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should …

scala apache-spark dataframe apache-spark-sql spark-csv
How to select the first row of each group?

I have a DataFrame generated as follow: df.groupBy($"Hour", $"Category") .agg(sum($"value") as "TotalValue") .sort($"Hour".asc, $"TotalValue".…

sql scala apache-spark dataframe apache-spark-sql
Spark specify multiple column conditions for dataframe join

How to give more column conditions when joining two dataframes. For example I want to run the following : val Lead_…

apache-spark apache-spark-sql rdd
How does createOrReplaceTempView work in Spark?

I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of …

apache-spark apache-spark-sql spark-dataframe