I am a novice to spark, and I want to transform below source dataframe (load from JSON file):
+--+-----+-----+
|A |count|major|
+--+-----+-----+
| a| 1| m1|
| a| 1| m2|
| a| 2| m3|
| a| 3| m4|
| b| 4| m1|
| b| 1| m2|
| b| 2| m3|
| c| 3| m1|
| c| 4| m3|
| c| 5| m4|
| d| 6| m1|
| d| 1| m2|
| d| 2| m3|
| d| 3| m4|
| d| 4| m5|
| e| 4| m1|
| e| 5| m2|
| e| 1| m3|
| e| 1| m4|
| e| 1| m5|
+--+-----+-----+
Into below result dataframe:
+--+--+--+--+--+--+
|A |m1|m2|m3|m4|m5|
+--+--+--+--+--+--+
| a| 1| 1| 2| 3| 0|
| b| 4| 2| 1| 0| 0|
| c| 3| 0| 4| 5| 0|
| d| 6| 1| 2| 3| 4|
| e| 4| 5| 1| 1| 1|
+--+--+--+--+--+--+
Here is the Transformation Rule:
The result dataframe is consisted with A + (n major columns)
where the major
columns names are specified by:
sorted(src_df.map(lambda x: x[2]).distinct().collect())
The result dataframe contains m
rows where the values for A
column are provided by:
sorted(src_df.map(lambda x: x[0]).distinct().collect())
The value for each major column in result dataframe is the value from source dataframe on the corresponding A
and major
(e.g. the count in Row 1 in source dataframe is mapped to the box
where A
is a
and column m1
)
The combinations of A
and major
in source dataframe has no duplication (please consider it a primary key on the two columns in SQL)
Using zero323's dataframe,
df = sqlContext.createDataFrame([
("a", 1, "m1"), ("a", 1, "m2"), ("a", 2, "m3"),
("a", 3, "m4"), ("b", 4, "m1"), ("b", 1, "m2"),
("b", 2, "m3"), ("c", 3, "m1"), ("c", 4, "m3"),
("c", 5, "m4"), ("d", 6, "m1"), ("d", 1, "m2"),
("d", 2, "m3"), ("d", 3, "m4"), ("d", 4, "m5"),
("e", 4, "m1"), ("e", 5, "m2"), ("e", 1, "m3"),
("e", 1, "m4"), ("e", 1, "m5")],
("a", "cnt", "major"))
you could also use
reshaped_df = df.groupby('a').pivot('major').max('cnt').fillna(0)