Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

Spark SQL Row_number() PartitionBy Sort Desc

I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, …

python apache-spark pyspark apache-spark-sql window-functions
Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution; df1.cache() …

apache-spark apache-spark-sql spark-streaming
Retrieve top n in each group of a DataFrame in pyspark

There's a DataFrame in pyspark with data as below: user_id object_id score user_1 object_1 3 user_1 object_1 1 user_1 object_2 2 …

python apache-spark dataframe pyspark apache-spark-sql
Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look …

python apache-spark pyspark apache-spark-sql pyspark-sql
How to convert Row of a Scala DataFrame into case class most efficiently?

Once I have got in Spark some Row class, either Dataframe or Catalyst, I want to convert it to a …

scala apache-spark apache-spark-sql
Derive multiple columns from a single column in a Spark DataFrame

I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it …

scala apache-spark dataframe apache-spark-sql user-defined-functions
TypeError: got an unexpected keyword argument

The seemingly simple code below throws the following error: Traceback (most recent call last): File "/home/nirmal/process.py", line 165, …

python apache-spark pyspark apache-spark-sql user-defined-functions
DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I …

apache-spark apache-spark-sql
'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (…

python apache-spark pyspark apache-spark-sql rdd
Can we load Parquet file into Hive directly?

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the …

hadoop hive apache-spark-sql hiveql parquet