Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.
I was trying this tutorial http://spark.apache.org/docs/latest/quick-start.html I first created a collection from a …
sorting apache-spark word-countI am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already …
json apache-spark spark-notebookI have Spark DataFrame with take(5) top rows as follows: [Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), …
python timestamp apache-spark pysparkI read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the …
apache-spark distributed-computingI know how to find the file size in scala.But how to find a RDD/dataframe size in spark? …
scala apache-spark rddThe goal of this question is to document: steps required to read and write data using JDBC connections in PySpark …
python scala apache-spark apache-spark-sql pysparkI am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. …
apache-spark dataframe apache-spark-sql apache-spark-1.4I have 2 DataFrames as followed : I need union like this: The unionAll function doesn't work because the number and the …
apache-spark pyspark apache-spark-sqlHow can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is …
python apache-spark median rdd pysparkI want to change names of two columns using spark withColumnRenamed function. Of course, I can write: data = sqlContext.createDataFrame([(1,2), (3,4)], […
apache-spark pyspark apache-spark-sql rename