Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.
I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should …
scala apache-spark dataframe apache-spark-sql spark-csvSuppose I give three files paths to a Spark context to read and each file has a schema in the …
scala csv apache-sparkI have a DataFrame generated as follow: df.groupBy($"Hour", $"Category") .agg(sum($"value") as "TotalValue") .sort($"Hour".asc, $"TotalValue".…
sql scala apache-spark dataframe apache-spark-sqlI am using the randomSplitfunction to get a small amount of a dataframe to use in dev purposes and I …
scala apache-sparkI am trying to run a spark program where i have multiple jar files, if I had only one jar …
submit apache-spark classpathI want to check the spark version in cdh 5.7.0. I have searched on the internet but not able to understand. …
apache-spark hadoop clouderaIn Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first …
apache-spark dataframe rddHow to give more column conditions when joining two dataframes. For example I want to run the following : val Lead_…
apache-spark apache-spark-sql rddI am using CDH 5.2. I am able to use spark-shell to run the commands. How can I run the file(…
scala apache-spark cloudera-cdh cloudera-managerIn terms of RDD persistence, what are the differences between cache() and persist() in spark ?
apache-spark distributed-computing rdd