Top "Spark-dataframe" questions

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system.

Pyspark: Pass multiple columns in UDF

I am writing a User Defined Function which will take all the columns except the first one in a dataframe …

apache-spark pyspark spark-dataframe
How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like …

apache-spark spark-dataframe partitioning parquet
Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") …

apache-spark spark-dataframe rdd apache-spark-2.0 bigdata
Printschema() in Apache Spark

Dataset<Tweet> ds = sc.read().json("/path").as(Encoders.bean(Tweet.class)); Tweet class :- long id string …

apache-spark spark-dataframe apache-spark-dataset
Applying a Window function to calculate differences in pySpark

I am using pySpark, and have set up my dataframe with two columns representing a daily asset price as follows: …

pyspark spark-dataframe window-functions pyspark-sql
SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

I have a DataFrame generated as follows: df.groupBy($"Hour", $"Category") .agg(sum($"value").alias("TotalValue")) .sort($"Hour".asc,$"TotalValue".…

scala apache-spark apache-spark-sql spark-dataframe parquet
Spark DataFrame: does groupBy after orderBy maintain that order?

I have a Spark 2.0 dataframe example with the following structure: id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. …

scala apache-spark apache-spark-sql spark-streaming spark-dataframe
extracting numpy array from Pyspark Dataframe

I have a dataframe gi_man_df where group can be n: +------------------+-----------------+--------+--------------+ | group | number|rand_int| …

numpy apache-spark pyspark spark-dataframe apache-spark-mllib
PySpark - Pass list as parameter to UDF

I need to pass a list into a UDF, the list will determine the score/category of the distance. For …

python pyspark spark-dataframe user-defined-functions
check if a row value is null in spark dataframe

I am using a custom function in pyspark to check a condition for each row in a spark dataframe and …

apache-spark pyspark user-defined-functions spark-dataframe isnull