In MySQL, I can select from two tables without a join, like so:
SELECT t1.value, t2.value FROM t1, t2 WHERE (t1.value = t2.value);
Hive, on the other hand, will accept "FROM t1 join t2" but not "FROM t1, t2".)
Does anyone have any ideas about how to optimize a query like
SELECT t1.value, t2.value FROM t1 join t2 WHERE (t1.value = t2.value);
in any other way?
(Also, why does switching from "select from t1 join t2" to "select from t1, t2" in MySQL optimize queries anyway?)
Why don't you want to use a join? Selecting from two tables and requiring some equalities between them results in an inner join.
Also, with the join you are using, you are creating the cartesian product of both tables and eliminate those records, where t1.value=t2.value
. Directly using an inner join would be more efficient:
SELECT t1.value, t2.value FROM t1 JOIN t2 ON t1.value=t2.value;
If one of your tables is remarkable small, you could do a map-side join. The small table would be cached in the memory while the larger one can be streamed through and no reduce step would be necessary. To activate the map-side join you have to execute set hive.auto.convert.join=true;
before executing the query. The threshold for the maximum table size in bytes for map-side joins is set in the property hive.mapjoin.smalltable.filesize
.
(Source: Edward Capriolo, Dean Wampler, and Jason Rutherglen. Programming Hive. O’Reilly, 2012.)