Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.
I see exit codes and exit statuses all the time when running spark on yarn: Here are a few: CoarseGrainedExecutorBackend: …
hadoop apache-spark pyspark spark-dataframe yarnI am working in Databricks. I have a dataframe which contains 500 rows, I would like to create two dataframes on …
python pyspark spark-dataframe databricksI have run into a problem where I have Parquet data as daily chunks in S3 (in the form of …
apache-spark apache-spark-sql spark-dataframe emr parquetI can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.…
java apache-spark spark-dataframe apache-spark-datasetI need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify …
apache-spark apache-spark-sql spark-dataframe partitioning apache-spark-datasetval tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278) val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0) I have two Arrays as above, i …
arrays scala linear-regression spark-dataframeI trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_…
python apache-spark pyspark spark-dataframe apache-spark-mlI have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some …
pyspark spark-dataframe apache-spark-2.0I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (…
apache-spark teradata pyspark spark-dataframeHow can I get the first non-null values from a group by? I tried using first with coalesce F.first(…
apache-spark pyspark spark-dataframe apache-spark-1.6