I am running a sum function of one hive table in Hue, and get a return value of NaN.
Here is my code:
select sum(v1) from hivedb.tb1;
I don't know why it is giving me a NaN result. I checked if any of my v1 values are null:
select * from hivedb.tb1 where v1 is null;
, and it turns out no record has null value. The table has 100 million rows, so I can not do a manual check for each record.
Any help is appreciated. Thank you in advance!
UPDATE 1 I manually screened the first 1000 rows, and luckily spotted some abnormal values of NaN in tb1. It is resulted from some rounding error from the previous steps. So my question 1 is probably answered. Please feel free to comment on it, if you think there could be other reasons.
I still don't know how to use an efficient way to spot the rows with NaN values. So I am still looking forward to any answers to my question #2. Please feel free to share. I appreciate your help.
UPDATE 2 The problem is solved with help in the accepted answer below, in the discussion section. There are multiple ways to deal with it.
Hive relies on Java (plus SQL-specific semantics for Null and friends), and Java honors the IEEE standard for number semantics. Which means that... NaN is tricky.
Quoting that post...
(Float.NaN == Float.NaN)
always returns false.
In fact, if you look at the JDK implementation ofFloat.isNaN()
, a number is not-a-number if it is not equal to itself (which makes sense because a number should be equal to itself).
The same holds for Double.NaN
So, there is no point in showing you how to use the (undocumented) Hive function called reflect2
, which allows you to invoke raw Java methods on Hive columns, i.e.
where v1 is not null and not reflect2(v1, "isNaN")
...because -- in theory -- you can simply state:
where v1 is not null and v1=v1
Disclaimer -- I have seen cases where the Hive optimizer makes aggressive "optimizations" and produces wrong results.
In other words, if the simple v1=v1
clause does not filter out the NaN values as expected, then look into reflect2
...
Edit -- indeed, the optimizer appears to ignore the v1=v1
clause in some versions of Hive (see comments) so a more devious formula is necessary:
v1 +1.0 > v1
should work... except when rounding errors make either abs(v1)
<< 1 or abs(v1)
>> 1v1 =0.0
In the end, the most robust approach appears to try cast(v1 as String) <>'NaN'
(because all possible NaN values are displayed as "NaN" even if they are not strictly "equal" in the arithmetical sense).
reflect2
-- you can see that it is indeed not mentioned in the official Hive doc, while reflect
is mentioned (and even has a specific Wiki entry). But it has been implemented as early as Hive V0.11 cf. Hive-4025
Edit -- Java "reflection" is now disabled by default for ODBC / JDBC / Hue connections (see comments), and cannot be re-enabled when using security plug-ins such as ranger or Sentry. So its usage is restricted to the (deprecated) hive
CLI.