Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

How does createOrReplaceTempView work in Spark?

I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of …

apache-spark apache-spark-sql spark-dataframe
How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) …

apache-spark pyspark apache-spark-sql pyspark-sql
Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to …

python apache-spark pyspark
Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark …

hadoop apache-spark yarn
How to check the Spark version

as titled, how do I know which version of spark has been installed in the CentOS? The current system has …

apache-spark cloudera-cdh
Pyspark: Exception: Java gateway process exited before sending the driver its port number

I'm trying to run pyspark on my macbook air. When i try starting it up I get the error: Exception: …

java python macos apache-spark pyspark
Spark RDD to DataFrame python

I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the …

python apache-spark pyspark spark-dataframe
Spark read file from S3 using sc.textFile ("s3n://...)

Trying to read a file located in S3 using spark-shell: scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.…

java scala apache-spark rdd hortonworks-data-platform
How to convert a DataFrame back to normal RDD in pyspark?

I need to use the (rdd.)partitionBy(npartitions, custom_partitioner) method that is not available on the DataFrame. All of …

python apache-spark pyspark
Getting the count of records in a data frame quickly

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is …

scala apache-spark hadoop-streaming