Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing.
We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). …
apache-spark pyspark apache-spark-sql databricksIs there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) …
apache-spark dataframe apache-spark-sqlI have a resulting RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions). This has output in this format: [(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....] …
python csv apache-spark pyspark file-writingI have a sample application working to read from csv files into a dataframe. The dataframe can be stored to …
hadoop apache-spark hive apache-spark-sql spark-dataframeI'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on …
python apache-spark pyspark pycharm homebrewWhen a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we …
scala apache-spark rddI'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for …
java scala apache-spark pyspark apache-spark-sqlI built Spark 1.4 from the GH development master, and the build went through fine. But when I do a bin/…
python python-3.x apache-sparkI just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me …
python sql apache-spark pysparkI am trying to filter a dataframe in pyspark using a list. I want to either filter based on the …
apache-spark filter pyspark apache-spark-sql