Top "Spark-dataframe" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

Scala: Spark SQL to_date(unix_timestamp) returning NULL

Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8 I am loading a raw csv into a DataFrame. In csv, although the column is …

scala apache-spark apache-spark-sql spark-dataframe spark-csv
Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

I'm perplexed between the behaviour of numPartitions parameter in the following methods: DataFrameReader.jdbc Dataset.repartition The official docs of …

apache-spark dataframe spark-dataframe spark-jdbc
Spark SQL: How to consume json data from a REST service as DataFrame

I need to read some JSON data from a web service thats providing REST interfaces to query the data from …

apache-spark-sql spark-dataframe azure-hdinsight
Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, …

apache-spark spark-dataframe distributed-computing partitioning bigdata
Why does Spark job fail with "Exit code: 52"

I have had Spark job failing with a trace like this one: ./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Container id: container_1455622885057_0016_01_000008 ./containers/application_1455622885057_0016/…

apache-spark yarn spark-dataframe
Ho to read ".gz" compressed file using spark DF or DS?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS? …

apache-spark apache-spark-sql spark-dataframe gzip apache-spark-dataset
How to find the nearest neighbors of 1 Billion records with Spark?

Given 1 Billion records containing following information: ID x1 x2 x3 ... x100 1 0.1 0.12 1.3 ... -2.00 2 -1 1.2 2 ... 3 ... For each ID above, I want to …

apache-spark pyspark spark-dataframe nearest-neighbor euclidean-distance
Spark DataSet filter performance

I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite …

apache-spark apache-spark-sql spark-dataframe apache-spark-dataset
Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of …

scala apache-spark rdd spark-dataframe apache-spark-mllib
How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala)

Im using Spark 2.0. I have a column of my dataframe containing a WrappedArray of WrappedArrays of Float. An example of …

arrays scala casting spark-dataframe apache-spark-2.0