Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

How do I convert an array (i.e. list) column to Vector

Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql …

python apache-spark pyspark apache-spark-sql apache-spark-ml
Apache Spark -- Assign the result of UDF to multiple dataframe columns

I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need …

python apache-spark pyspark apache-spark-sql user-defined-functions
Spark functions vs UDF performance?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original …

performance apache-spark pyspark apache-spark-sql user-defined-functions
Scala and Spark UDF function

I made a simple UDF to convert or extract some values from a time field in a temptabl in spark. …

scala apache-spark apache-spark-sql apache-zeppelin
Methods for writing Parquet files using Python?

I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can …

python apache-spark apache-spark-sql parquet snappy
How to get keys and values from MapType column in SparkSQL DataFrame

I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<>. It is …

scala apache-spark dataframe apache-spark-sql apache-spark-dataset
"sparkContext was shut down" while running spark on a large dataset

When running sparkJob on a cluster past a certain data size(~2,5gb) I am getting either "Job cancelled because SparkContext …

scala apache-spark yarn apache-spark-sql
Explode (transpose?) multiple columns in Spark SQL table

I am using Spark SQL (I mention that it is in Spark in case that affects the SQL syntax - …

sql apache-spark apache-spark-sql hiveql
Queries with streaming sources must be executed with writeStream.start();

I'm trying to read the messages from kafka (version 10) in spark and trying to print it. import spark.implicits._ val …

scala apache-spark-sql spark-streaming