Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

What is the difference between map and flatMap and a good use case for each?

Can someone explain to me the difference between map and flatMap and what is a good use case for each? …

apache-spark
Converting Pandas dataframe into Spark dataframe error

I'm trying to convert Pandas DF into Spark one. DF head: 10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543 10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611 10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691 …

python pandas apache-spark spark-dataframe
Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on each of them: from pyspark.sql.…

python apache-spark pyspark
How to define partitioning of DataFrame?

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, …

scala apache-spark dataframe apache-spark-sql partitioning
Spark Kill Running Application

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any …

apache-spark yarn pyspark
Convert date from String to Date format in Dataframes

I am trying to convert a column which is in String format to Date format using the to_date function …

apache-spark apache-spark-sql
Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the …

python apache-spark pyspark apache-spark-sql
importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask …

python apache-spark pyspark
How to set Apache Spark Executor memory

How can I increase the memory available for Apache spark executor nodes? I have a 2 GB file that is suitable …

memory apache-spark
Joining Spark dataframes on the key

I have constructed two dataframes. How can we join multiple Spark dataframes ? For Example : PersonDf, ProfileDf with a common column …

scala apache-spark dataframe apache-spark-sql