Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

Provide schema while reading csv file as a dataframe

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should …

scala apache-spark dataframe apache-spark-sql spark-csv
How do I skip a header from CSV files in Spark?

Suppose I give three files paths to a Spark context to read and each file has a schema in the …

scala csv apache-spark
How to select the first row of each group?

I have a DataFrame generated as follow: df.groupBy($"Hour", $"Category") .agg(sum($"value") as "TotalValue") .sort($"Hour".asc, $"TotalValue".…

sql scala apache-spark dataframe apache-spark-sql
Is there a way to take the first 1000 rows of a Spark Dataframe?

I am using the randomSplitfunction to get a small amount of a dataframe to use in dev purposes and I …

scala apache-spark
spark submit add multiple jars in classpath

I am trying to run a spark program where i have multiple jar files, if I had only one jar …

submit apache-spark classpath
How to check Spark Version

I want to check the spark version in cdh 5.7.0. I have searched on the internet but not able to understand. …

apache-spark hadoop cloudera
Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first …

apache-spark dataframe rdd
Spark specify multiple column conditions for dataframe join

How to give more column conditions when joining two dataframes. For example I want to run the following : val Lead_…

apache-spark apache-spark-sql rdd
Spark : how to run spark file from spark shell

I am using CDH 5.2. I am able to use spark-shell to run the commands. How can I run the file(…

scala apache-spark cloudera-cdh cloudera-manager
What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

apache-spark distributed-computing rdd