Top "Apache-spark-sql" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) …

apache-spark pyspark apache-spark-sql pyspark-sql
Take n rows from a spark dataframe and pass to toPandas()

I have this code: l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).…

python apache-spark-sql spark-dataframe
What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism? I have tried to set both of them …

performance apache-spark hadoop apache-spark-sql
Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL …

python apache-spark pyspark apache-spark-sql
How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it …

apache-spark apache-spark-sql
How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I …

python apache-spark join pyspark apache-spark-sql
How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load …

apache-spark apache-spark-sql spark-dataframe
How to count unique ID after groupBy in pyspark

I'm using the following code to agregate students per year. The purpose is to know the total number of student …

python pyspark apache-spark-sql
How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark …

python scala apache-spark apache-spark-sql pyspark
DataFrame join optimization - Broadcast Hash Join

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. …

apache-spark dataframe apache-spark-sql apache-spark-1.4