Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.
I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, …
scala apache-spark rddI am already compressing RDDs using conf.set("spark.rdd.compress","true") and persist(MEMORY_AND_DISK_SER). Will using …
scala apache-spark rdd kryoWhat is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession? Is there any method to convert or create a Context …
java scala apache-spark rdd apache-spark-datasetWhen I run a Spark job using its example code BinaryClassification.scala with my own data, it always shows the …
scala out-of-memory apache-spark rddI am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of …
scala apache-spark rdd spark-dataframe apache-spark-mllibI have an RDD like this: 1 2 3 4 5 6 7 8 9 It is a matrix. Now I want to transpose the RDD like this: 1 4 7 2 5 8 3 6 9 How …
scala apache-spark rddI always use reduceByKey when I need to group data in RDDs, because it performs a map side reduce before …
apache-spark rddI'm using a broadcast variable about 100 MB pickled in size, which I'm approximating with: >>> data = list(range(…
python apache-spark pyspark pickle rddAttempting to run http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala from source. This line: val wordCounts = textFile.flatMap(…
scala apache-spark rddI tried to apply PCA to my data and then apply RandomForest to the transformed data. However, PCA.transform(data) …
scala apache-spark rdd pca apache-spark-mllib