Popular "spark-dataframe" questions | Page 4

Do exit codes and exit statuses mean anything in spark?

I see exit codes and exit statuses all the time when running spark on yarn: Here are a few: CoarseGrainedExecutorBackend: …

hadoop apache-spark pyspark spark-dataframe yarn

How to slice a pyspark dataframe in two row-wise

I am working in Databricks. I have a dataframe which contains 500 rows, I would like to create two dataframes on …

python pyspark spark-dataframe databricks

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of …

apache-spark apache-spark-sql spark-dataframe emr parquet

How to convert DataFrame to Dataset in Apache Spark in Java?

I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.…

java apache-spark spark-dataframe apache-spark-dataset

Partition data for efficient joining for Spark dataframe/dataset

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify …

apache-spark apache-spark-sql spark-dataframe partitioning apache-spark-dataset

how to create DataFrame from multiple arrays in Spark Scala?

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278) val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0) I have two Arrays as above, i …

arrays scala linear-regression spark-dataframe

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_…

python apache-spark pyspark spark-dataframe apache-spark-ml

spark join raises "Detected cartesian product for INNER join"

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some …

pyspark spark-dataframe apache-spark-2.0

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (…

apache-spark teradata pyspark spark-dataframe

Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(…

apache-spark pyspark spark-dataframe apache-spark-1.6

Top "Spark-dataframe" questions