Why does full outer join in HIVE gives weird result when one of the join fields is missing?

makansij picture makansij · Jun 14, 2016 · Viewed 11.1k times · Source

I'm comparing the behavior between SQL engines. Oracle has the behavior I would expect from a SQL engine for full outer joins:

Oracle

CREATE TABLE sql_test_a 
( 
    ID         VARCHAR2(4000 BYTE), 
    FIRST_NAME VARCHAR2(200 BYTE), 
    LAST_NAME  VARCHAR2(200 BYTE) 
); 

CREATE TABLE sql_test_b 
( 
    NUM         VARCHAR2(4000 BYTE), 
    FIRST_NAME VARCHAR2(200 BYTE), 
    LAST_NAME  VARCHAR2(200 BYTE) 
); 

INSERT INTO sql_test_a (ID, FIRST_NAME, LAST_NAME) VALUES ('1', 'John', 'Snow'); 

INSERT INTO sql_test_a (ID, FIRST_NAME, LAST_NAME) VALUES ('2', 'Mike', 'Tyson'); 


INSERT INTO sql_test_b (NUM, FIRST_NAME, LAST_NAME) VALUES ('20', 'Mike', 'Tyson'); 

When I execute the following, it gives me the expected result. The resulting table contains two rows, with one of the rows containing NULL for the NUM field, because there is no john snow in the table sql_test_b.

SELECT A.FIRST_NAME, A.LAST_NAME, A.ID, B.NUM
FROM
SQL_TEST_A A
FULL OUTER JOIN
SQL_TEST_B B
ON 
A.FIRST_NAME = B.FIRST_NAME 
AND
A.LAST_NAME = B.LAST_NAME;

You can test the sql script here: http://sqltest.net/

HIVE

In HIVE, however, if you were to try the same thing, the full outer join results in a table with two rows. The row that should be the "John Snow" row contains NULL for the fields FIRST_NAME, LAST_NAME, and NUM. The 1 is filled in for ID, but that's it.

Why such weird behavior in HIVE? Is this a bug? Or am I missing something...because Oracle 11g seems to handle this much better. Thanks.

Answer

Marco99 picture Marco99 · Jul 11, 2016

I could not simulate the result reported by @Candic3

I used the below statements along with the same "select" query as in the question.

CREATE TABLE IF NOT EXISTS sql_test_a (ID String, FIRST_NAME String, LAST_NAME  String) COMMENT 'sql_test_a'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;


CREATE TABLE IF NOT EXISTS sql_test_b (NUM String, FIRST_NAME String, LAST_NAME  String) COMMENT 'sql_test_b'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;


INSERT INTO sql_test_a VALUES ('1', 'John', 'Snow');
INSERT INTO sql_test_a VALUES ('2', 'Mike', 'Tyson');
INSERT INTO sql_test_b VALUES ('20', 'Mike', 'Tyson');

SELECT A.FIRST_NAME, A.LAST_NAME, A.ID, B.NUM
FROM
SQL_TEST_A A
FULL OUTER JOIN
SQL_TEST_B B
ON 
A.FIRST_NAME = B.FIRST_NAME 
AND
A.LAST_NAME = B.LAST_NAME;

Please find the result attached.

However, select query would return NULL due to unnoticed minor mistakes like data-type mismatch between the DDL and the actual data (say, from flat files) or mismatch among the delimiter mentioned in the DDL and the ones in the actual data.Query_result