Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.
I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of …
apache-spark apache-spark-sql spark-dataframeimport numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) …
apache-spark pyspark apache-spark-sql pyspark-sqlI'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to …
python apache-spark pysparkI'm trying to understand the relationship of the number of cores and the number of executors when running a Spark …
hadoop apache-spark yarnas titled, how do I know which version of spark has been installed in the CentOS? The current system has …
apache-spark cloudera-cdhI'm trying to run pyspark on my macbook air. When i try starting it up I get the error: Exception: …
java python macos apache-spark pysparkI am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the …
python apache-spark pyspark spark-dataframeTrying to read a file located in S3 using spark-shell: scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.…
java scala apache-spark rdd hortonworks-data-platformI need to use the (rdd.)partitionBy(npartitions, custom_partitioner) method that is not available on the DataFrame. All of …
python apache-spark pysparkI have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is …
scala apache-spark hadoop-streaming