Top "Partitioning" questions

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

Write Spark dataframe as CSV with partitions

I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the …

csv apache-spark apache-spark-sql partitioning
Undo Table Partitioning

I have a table 'X' and did the following CREATE PARTITION FUNCTION PF1(INT) AS RANGE LEFT FOR VALUES (1, 2, 3, 4) CREATE …

sql-server database sql-server-2008 partitioning
How to find all partitions of a set

I have a set of distinct values. I am looking for a way to generate all partitions of this set, …

c# algorithm set partitioning
Which part of the CAP theorem does Cassandra sacrifice and why?

There is a great talk here about simulating partition issues in Cassandra with Kingsby's Jesper library. My question is - …

cassandra partitioning high-availability consistency cap-theorem
Does Spark know the partitioning key of a DataFrame?

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid …

apache-spark partitioning window-functions
Partitioning data set in r based on multiple classes of observations

I'm trying to partition a data set that I have in R, 2/3 for training and 1/3 for testing. I have one …

r random partitioning
Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, …

apache-spark spark-dataframe distributed-computing partitioning bigdata
In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in …

apache-spark partitioning hadoop-partitioning
Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe. For …

apache-spark pyspark apache-spark-sql partitioning window-functions