Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.
I am writing a User Defined Function which will take all the columns except the first one in a dataframe …
apache-spark pyspark spark-dataframeI am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like …
apache-spark spark-dataframe partitioning parquetI am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") …
apache-spark spark-dataframe rdd apache-spark-2.0 bigdataDataset<Tweet> ds = sc.read().json("/path").as(Encoders.bean(Tweet.class)); Tweet class :- long id string …
apache-spark spark-dataframe apache-spark-datasetI am using pySpark, and have set up my dataframe with two columns representing a daily asset price as follows: …
pyspark spark-dataframe window-functions pyspark-sqlI have a DataFrame generated as follows: df.groupBy($"Hour", $"Category") .agg(sum($"value").alias("TotalValue")) .sort($"Hour".asc,$"TotalValue".…
scala apache-spark apache-spark-sql spark-dataframe parquetI have a Spark 2.0 dataframe example with the following structure: id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. …
scala apache-spark apache-spark-sql spark-streaming spark-dataframeI have a dataframe gi_man_df where group can be n: +------------------+-----------------+--------+--------------+ | group | number|rand_int| …
numpy apache-spark pyspark spark-dataframe apache-spark-mllibI need to pass a list into a UDF, the list will determine the score/category of the distance. For …
python pyspark spark-dataframe user-defined-functionsI am using a custom function in pyspark to check a condition for each row in a spark dataframe and …
apache-spark pyspark user-defined-functions spark-dataframe isnull