Top "Pyspark" questions

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

Apache spark dealing with case statements

I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how …

apache-spark pyspark spark-dataframe rdd pyspark-sql
Spark SQL Row_number() PartitionBy Sort Desc

I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, …

python apache-spark pyspark apache-spark-sql window-functions
spark 2.1.0 session config settings (pyspark)

I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource. …

python apache-spark pyspark spark-dataframe
Retrieve top n in each group of a DataFrame in pyspark

There's a DataFrame in pyspark with data as below: user_id object_id score user_1 object_1 3 user_1 object_1 1 user_1 object_2 2 …

python apache-spark dataframe pyspark apache-spark-sql
Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look …

python apache-spark pyspark apache-spark-sql pyspark-sql
TypeError: got an unexpected keyword argument

The seemingly simple code below throws the following error: Traceback (most recent call last): File "/home/nirmal/process.py", line 165, …

python apache-spark pyspark apache-spark-sql user-defined-functions
PySpark: java.lang.OutofMemoryError: Java heap space

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on …

java apache-spark out-of-memory heap-memory pyspark
PySpark groupByKey returning pyspark.resultiterable.ResultIterable

I am trying to figure out why my groupByKey is returning the following: [(0, <pyspark.resultiterable.ResultIterable object at 0x7…

python apache-spark pyspark
'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (…

python apache-spark pyspark apache-spark-sql rdd
Pyspark: repartition vs partitionBy

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've …

apache-spark pyspark rdd