I am not able to understand what this DISTRIBUTE BY
clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city)
, this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:
Say we have a table called data with columns username and amount:
+----------+--------+
| username | amount |
+----------+--------+
| user_1 | 25 |
+----------+--------+
| user_1 | 53 |
+----------+--------+
| user_1 | 28 |
+----------+--------+
| user_1 | 50 |
+----------+--------+
| user_2 | 20 |
+----------+--------+
| user_2 | 50 |
+----------+--------+
| user_2 | 10 |
+----------+--------+
| user_2 | 5 |
+----------+--------+
Now If I say -
SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)
Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?
The only thing DISTRIBUTE BY (city)
says is that records with the same city
will go to the same reducer. Nothing else.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
A question by the OP:
Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?
For 2 reasons:
In the beginning of hive DISTRIBUTE BY
, SORT BY
and CLUSTER BY
where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)
You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY
+ SORT BY
or CLUSTER BY
. With DISTRIBUTE BY
it is guaranteed that you'll have the whole group in the same reducer. With SORT BY
that you'll get all the records of a group continuously.