The Spark Python API (PySpark) exposes the apache-spark programming model to Python.
I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal …
sql filter pyspark apache-spark-sql pyspark-sqlLet's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,…
apache-spark apache-spark-sql pysparkLooking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns. How would …
python apache-spark pyspark apache-spark-sql spark-dataframeI'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .…
apache-spark pysparkI have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and …
python dataframe pysparkI've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more …
apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sqlI installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script …
python scala apache-spark hadoop pysparkI am using pyspark to read a parquet file like below: my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/…
python pandas pyspark spark-dataframeI would like to modify the cell values of a dataframe column (Age) where currently it is blank and I …
python apache-spark dataframe pyspark apache-spark-sqlimport numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) …
apache-spark pyspark apache-spark-sql pyspark-sql