Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

Applying function to Spark Dataframe Column

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this …

scala apache-spark dataframe apache-spark-sql user-defined-functions
Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy?

What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods? I hope both are used to "partition data based …

apache-spark-sql data-partitioning
Pyspark: Split multiple array columns into rows

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others …

python apache-spark dataframe pyspark apache-spark-sql
SparkSQL vs Hive on Spark - Difference and pros and cons?

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody …

apache-spark hadoop hive apache-spark-sql
SparkSQL: How to deal with null values in user defined function?

Given Table 1 with one column "x" of type String. I want to create Table 2 with a column "y" that is …

scala apache-spark apache-spark-sql user-defined-functions nullable
Median / quantiles within PySpark groupBy

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would …

apache-spark pyspark apache-spark-sql pyspark-sql
dynamically bind variable/parameter in Spark SQL?

How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) …

scala apache-spark apache-spark-sql apache-spark-2.0
inferSchema in spark-csv package

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to …

scala apache-spark apache-spark-sql spark-csv
Applying UDFs on GroupedData in PySpark (with functioning python example)

I have this python code that runs locally in a pandas dataframe: df_result = pd.DataFrame(df .groupby('A') .apply(…

python apache-spark pyspark apache-spark-sql user-defined-functions
How to add a new Struct column to a DataFrame

I'm currently trying to extract a database from MongoDB and use Spark to ingest into ElasticSearch with geo_points. The …

scala elasticsearch apache-spark etl apache-spark-sql