Top "Pyspark" questions

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows: [Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), …

python timestamp apache-spark pyspark
How to count unique ID after groupBy in pyspark

I'm using the following code to agregate students per year. The purpose is to know the total number of student …

python pyspark apache-spark-sql
How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark …

python scala apache-spark apache-spark-sql pyspark
How to perform union on two DataFrames with different amounts of columns in spark?

I have 2 DataFrames as followed : I need union like this: The unionAll function doesn't work because the number and the …

apache-spark pyspark apache-spark-sql
How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is …

python apache-spark median rdd pyspark
PySpark - rename more than one column using withColumnRenamed

I want to change names of two columns using spark withColumnRenamed function. Of course, I can write: data = sqlContext.createDataFrame([(1,2), (3,4)], […

apache-spark pyspark apache-spark-sql rename
get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). …

apache-spark pyspark apache-spark-sql databricks
How to write the resulting RDD to a csv file in Spark python

I have a resulting RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions). This has output in this format: [(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....] …

python csv apache-spark pyspark file-writing
How to link PyCharm with PySpark?

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on …

python apache-spark pyspark pycharm homebrew
aggregate function Count usage with groupBy in Spark

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for …

java scala apache-spark pyspark apache-spark-sql