Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

Count number of non-NaN entries in each column of Spark dataframe with Pyspark

I have a very large dataset that is loaded in Hive. It consists of about 1.9 million rows and 1450 columns. I …

python apache-spark dataframe pyspark apache-spark-sql
What is the difference between Apache Spark SQLContext vs HiveContext?

What are the differences between Apache Spark SQLContext and HiveContext ? Some sources say that since the HiveContext is a superset …

apache-spark hive apache-spark-sql
SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

I have a DataFrame generated as follows: df.groupBy($"Hour", $"Category") .agg(sum($"value").alias("TotalValue")) .sort($"Hour".asc,$"TotalValue".…

scala apache-spark apache-spark-sql spark-dataframe parquet
PySpark row-wise function composition

As a simplified example, I have a dataframe "df" with columns "col1,col2" and I want to compute a row-wise …

python apache-spark pyspark apache-spark-sql
Spark Dataset API - join

I am trying to use the Spark Dataset API but I am having some issues doing a simple join. Let's …

scala apache-spark apache-spark-sql apache-spark-dataset
Spark add new column to dataframe with value from previous row

I'm wondering how I can achieve the following in Spark (Pyspark) Initial Dataframe: +--+---+ |id|num| +--+---+ |4 |9.0| +--+…

python apache-spark dataframe pyspark apache-spark-sql
How can I pass extra parameters to UDFs in Spark SQL?

I want to parse the date columns in a DataFrame, and for each date column, the resolution for the date …

scala apache-spark apache-spark-sql user-defined-functions
DATEDIFF in SPARK SQl

I am new to Spark SQL. We are migrating data from SQL server to Databricks. I am using SPARK SQL . …

apache-spark-sql datediff databricks
How to save a partitioned parquet file in Spark 2.1?

I am trying to test how to write data in HDFS 2.7 using Spark 2.1. My data is a simple sequence of …

scala apache-spark apache-spark-sql parquet
How to create correct data frame for classification in Spark ML

I am trying to run random forest classification by using Spark ML api but I am having issues with creating …

scala apache-spark apache-spark-sql apache-spark-mllib