convert scientific notation in string format to numeric in spark dataframe

Nihal picture Nihal · Apr 30, 2019 · Viewed 13.5k times · Source
Day_Date,timeofday_desc,Timeofday_hour,Timeofday_minute,Timeofday_second,value
2017-12-18,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-19,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-20,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-21,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-22,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-23,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-24,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-25,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-26,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-27,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-28,03:52:08 AM,3,52,8,“1.779209040E+08”

I am trying to convert value column to 177920904

val df1 = df.withColumn("s", 'value.cast("Decimal(10,4)")).drop("value").withColumnRenamed("s", "value")

also tried casting value as Float, Double. Always get null as output

df1.select("value").show()


+-----------+
|   value   |
+-----------+
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|

df.printSchema

root
 |-- Day_Date: string (nullable = true)
 |-- timeofday_desc: string (nullable = true)
 |-- Timeofday_hour: string (nullable = true)
 |-- Timeofday_minute: string (nullable = true)
 |-- Timeofday_second: string (nullable = true)
 |-- value: string (nullable = true)

Answer

SCouto picture SCouto · Apr 30, 2019

Just need to cast it to decimal with enough room to fit the number.

Decimal is Decimal(precision, scale), so Decimal(10, 4) means 10 digits in total, 6 at the left of the dot, and 4 to the right, so the number does not fit in your Decimal type.

From the documentation

precision represents the total number of digits that can be represented

scale represents the number of fractional digits. This value must be less than or equal to precision. A scale of 0 produces integral values, with no fractional part

Since you don't want any number to the right, you can try this

df.withColumn("s", 'value.cast("Decimal(10,0)"))

If you want to keep 4 decimal digits, you can just change it to

df.withColumn("s", 'value.cast("Decimal(14,4)"))

INPUT

df.show
+---------------+
|          value|
+---------------+
|1.779209040E+08|
+---------------+

OUTPUT

scala> df.withColumn("s", 'value.cast("Decimal(10,0)")).show
+---------------+---------+
|          value|        s|
+---------------+---------+
|1.779209040E+08|177920904|
+---------------+---------+

FULL SOLUTION

Without dropping nor renamig

val df1 = df.withColumn("value", 'value.cast("Decimal(10,0)"))

FIX INPUT DATA

As I said in the comment, the problem is that your numbers contain some weird characters around them, you should remove it before casting

ORIGINAL

scala> df.show
+----------+--------------+--------------+----------------+----------------+-----------------+
|  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|            value|
+----------+--------------+--------------+----------------+----------------+-----------------+
|2017-12-18|   12:21:02 AM|             0|              21|               2| ?1.779209040E+08|
|2017-12-19|   12:21:02 AM|             0|              21|               2|?1.779209040E+08?|
|2017-12-20|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
|2017-12-21|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
|2017-12-22|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
|2017-12-23|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
|2017-12-24|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
|2017-12-25|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
|2017-12-26|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
|2017-12-27|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
|2017-12-28|   03:52:08 AM|             3|              52|               8| ?1.779209040E+08|
+----------+--------------+--------------+----------------+----------------+-----------------+

There are many ways to remove them, a quick one is with an UDF and a regular expression to remove everything but numbers, letters, dot, + and -

 def clean(input: String) = input.replaceAll("[^a-zA-Z0-9\\+\\.-]", "")
 val cleanUDF = udf(clean _ )
df.withColumn("value", cleanUDF($"value").cast(DecimalType(10,0))).show
+----------+--------------+--------------+----------------+----------------+---------+
|  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|    value|
+----------+--------------+--------------+----------------+----------------+---------+
|2017-12-18|   12:21:02 AM|             0|              21|               2|177920904|
|2017-12-19|   12:21:02 AM|             0|              21|               2|177920904|
|2017-12-20|   12:30:52 AM|             0|              30|              52|177920904|
|2017-12-21|   12:30:52 AM|             0|              30|              52|177920904|
|2017-12-22|   12:47:10 AM|             0|              47|              10|177920904|
|2017-12-23|   12:47:10 AM|             0|              47|              10|177920904|
|2017-12-24|   02:46:59 AM|             2|              46|              59|177920904|
|2017-12-25|   02:46:59 AM|             2|              46|              59|177920904|
|2017-12-26|   03:10:27 AM|             3|              10|              27|177920904|
|2017-12-27|   03:10:27 AM|             3|              10|              27|177920904|
|2017-12-28|   03:52:08 AM|             3|              52|               8|177920904|
+----------+--------------+--------------+----------------+----------------+---------+