Top "Apache-spark" questions

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.

get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). …

apache-spark pyspark apache-spark-sql databricks
Get current number of partitions of a DataFrame

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) …

apache-spark dataframe apache-spark-sql
How to write the resulting RDD to a csv file in Spark python

I have a resulting RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions). This has output in this format: [(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....] …

python csv apache-spark pyspark file-writing
Save Spark dataframe as dynamic partitioned table in Hive

I have a sample application working to read from csv files into a dataframe. The dataframe can be stored to …

hadoop apache-spark hive apache-spark-sql spark-dataframe
How to link PyCharm with PySpark?

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on …

python apache-spark pyspark pycharm homebrew
(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we …

scala apache-spark rdd
aggregate function Count usage with groupBy in Spark

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for …

java scala apache-spark pyspark apache-spark-sql
Apache Spark: How to use pyspark with Python 3

I built Spark 1.4 from the GH development master, and the build went through fine. But when I do a bin/…

python python-3.x apache-spark
How to build a sparkSession in Spark 2.0 using pyspark?

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me …

python sql apache-spark pyspark
pyspark dataframe filter or include based on list

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the …

apache-spark filter pyspark apache-spark-sql