Top "Pyspark" questions

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

How to build a sparkSession in Spark 2.0 using pyspark?

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me …

python sql apache-spark pyspark
pyspark dataframe filter or include based on list

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the …

apache-spark filter pyspark apache-spark-sql
PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame: a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | And I want to replace null values only …

apache-spark pyspark spark-dataframe
Create Spark DataFrame. Can not infer schema for type: <type 'float'>

Could someone help me solve this problem I have with Spark DataFrame? When I do myFloatRDD.toDF() I get an …

python apache-spark dataframe pyspark apache-spark-sql
How to replace all Null values of a dataframe in Pyspark

I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values …

dataframe null pyspark
Pyspark dataframe operator "IS NOT IN"

I would like to rewrite this from R to Pyspark, any nice looking suggestions? array <- c(1,2,3) dataset <…

pyspark
Unable to infer schema when loading Parquet file

response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(…

apache-spark pyspark parquet
Spark union of multiple RDDs

In my pig code I do this: all_combined = Union relation1, relation2, relation3, relation4, relation5, relation 6. I want to do …

python apache-spark pyspark rdd
pyspark collect_set or collect_list with groupby

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_…

list group-by set pyspark collect
renaming columns for pyspark dataframes aggregates

I am analysing some data with pyspark dataframes, suppose I have a dataframe df that I am aggregating: df.groupBy("…

dataframe pyspark