Top "Pyspark" questions

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

Filtering a Pyspark DataFrame with SQL-like IN clause

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in sc = SparkContext() sqlc = SQLContext(sc) df = …

python sql apache-spark dataframe pyspark
Spark Error - Unsupported class file major version

I'm trying to install Spark on my Mac. I've used home-brew to install spark 2.4.0 and Scala. I've installed PySpark in …

java python macos apache-spark pyspark
Where do you need to use lit() in Pyspark SQL?

I'm trying to make sense of where you need to use a lit value, which is defined as a literal …

python apache-spark pyspark apache-spark-sql
PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the …

python dataframe sum pyspark
Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a …

python apache-spark pyspark spark-dataframe
"INSERT INTO ..." with SparkSQL HiveContext

I'm trying to run an insert statement with my HiveContext, like this: hiveContext.sql('insert into my_table (id, score) …

apache-spark apache-spark-sql pyspark apache-spark-1.5 hivecontext
Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend …

apache-spark pyspark apache-spark-sql
Trim string column in PySpark dataframe

I'm beginner on Python and Spark. After creating a DataFrame from CSV file, I would like to know how I …

apache-spark pyspark apache-spark-sql trim pyspark-sql
What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API. I use this code to load csv tab-separated into Spark Dataframe lines = sc.textFile(…

python pandas apache-spark pyspark
How to add third-party Java JAR files for use in PySpark

I have some third-party database client libraries in Java. I want to access them through java_gateway.py E.g.: …

python apache-spark pyspark py4j