I am getting this error while performing a simple join between two tables. I run this query in Hive command line. I am naming table as a & b. Table a is Hive internal table and b is External table (in Cassandra). Table a has only 1610 rows and Table b has ~8million rows. In actual production scenario Table a could get upto 100K rows. Shown below is my join with table b as the last table in the join
SELECT a.col1, a.col2, b.col3, b.col4 FROM a JOIN b ON (a.col1=b.col1 AND a.col2=b.col2);
Shown below is the error
Total MapReduce jobs = 1
Execution log at: /tmp/pricadmn/.log
2014-04-09 07:15:36 Starting to launch local task to process map join; maximum memory = 932184064
2014-04-09 07:16:41 Processing rows: 200000 Hashtable size: 199999 Memory usage: 197529208 percentage: 0.212
2014-04-09 07:17:12 Processing rows: 300000 Hashtable size: 299999 Memory usage: 163894528 percentage: 0.176
2014-04-09 07:17:43 Processing rows: 400000 Hashtable size: 399999 Memory usage: 347109936 percentage: 0.372
...
...
...
2014-04-09 07:24:29 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 714454400 percentage: 0.766
2014-04-09 07:25:03 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 901427928 percentage: 0.967
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID:
Stage-5
Logs:
/u/applic/pricadmn/dse-4.0.1/logs/hive/hive.log
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
I am using DSE 4.0.1. Following are few of my settings which you might be interested in
mapred.map.child.java.opts=-Xmx512M
mapred.reduce.child.java.opts=-Xmx512M
mapred.reduce.parallel.copies=20
hive.auto.convert.join=true
I increased mapred.map.child.java.opts to 1G and i got past few more records and then errored out. It doesn't look like a good solution. Also i changed the order in the join but no help. I saw this link Hive Map join : out of memory Exception but didn't solve my issue.
For me it looks Hive is trying to put the bigger table in memory during local task phase which i am confused. As per my understanding the second table (in my case table b) should be streamed in. Correct me if I am wrong. Any help in solving this issue is highly appreciated.
set hive.auto.convert.join = false;