Top "Pyspark" questions

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

Pyspark: Filter dataframe based on multiple conditions

I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal …

sql filter pyspark apache-spark-sql pyspark-sql
Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,…

apache-spark apache-spark-sql pyspark
Updating a dataframe column in spark

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns. How would …

python apache-spark pyspark apache-spark-sql spark-dataframe
How do I set the driver's python version in spark?

I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .…

apache-spark pyspark
How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and …

python dataframe pyspark
Split Spark Dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more …

apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
How to turn off INFO logging in Spark?

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script …

python scala apache-spark hadoop pyspark
Pyspark: display a spark data frame in a table format

I am using pyspark to read a parquet file like below: my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/…

python pandas pyspark spark-dataframe
PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I …

python apache-spark dataframe pyspark apache-spark-sql
How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) …

apache-spark pyspark apache-spark-sql pyspark-sql