Top "Pyspark" questions

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to …

python apache-spark pyspark
Pyspark: Exception: Java gateway process exited before sending the driver its port number

I'm trying to run pyspark on my macbook air. When i try starting it up I get the error: Exception: …

java python macos apache-spark pyspark
How to get name of dataframe column in pyspark?

In pandas, this can be done by column.name. But how to do the same when its column of spark …

pyspark pyspark-sql
Spark RDD to DataFrame python

I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the …

python apache-spark pyspark spark-dataframe
How to convert a DataFrame back to normal RDD in pyspark?

I need to use the (rdd.)partitionBy(npartitions, custom_partitioner) method that is not available on the DataFrame. All of …

python apache-spark pyspark
Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL …

python apache-spark pyspark apache-spark-sql
Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext. If I explicitly set it as …

apache-spark config pyspark
Query HIVE table in pyspark

I am using CDH5.5 I have a table created in HIVE default database and able to query it from the …

hive pyspark
Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, …

python apache-spark mapreduce pyspark rdd
How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I …

python apache-spark join pyspark apache-spark-sql