Top "Spark-dataframe" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

Do exit codes and exit statuses mean anything in spark?

I see exit codes and exit statuses all the time when running spark on yarn: Here are a few: CoarseGrainedExecutorBackend: …

hadoop apache-spark pyspark spark-dataframe yarn
How to slice a pyspark dataframe in two row-wise

I am working in Databricks. I have a dataframe which contains 500 rows, I would like to create two dataframes on …

python pyspark spark-dataframe databricks
How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of …

apache-spark apache-spark-sql spark-dataframe emr parquet
How to convert DataFrame to Dataset in Apache Spark in Java?

I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.…

java apache-spark spark-dataframe apache-spark-dataset
Partition data for efficient joining for Spark dataframe/dataset

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify …

apache-spark apache-spark-sql spark-dataframe partitioning apache-spark-dataset
how to create DataFrame from multiple arrays in Spark Scala?

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278) val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0) I have two Arrays as above, i …

arrays scala linear-regression spark-dataframe
Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_…

python apache-spark pyspark spark-dataframe apache-spark-ml
spark join raises "Detected cartesian product for INNER join"

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some …

pyspark spark-dataframe apache-spark-2.0
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (…

apache-spark teradata pyspark spark-dataframe
Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(…

apache-spark pyspark spark-dataframe apache-spark-1.6