How To Remove non-alpha numeric, or non-numeric characters with Hive REGEXP_EXTRACT() Function

user1152532 picture user1152532 · Jan 28, 2014 · Viewed 69.2k times · Source

I've been trying to figure out how to remove multiple non-alphanumeric or non-numeric characters, or return only the numeric characters from a string. I've tried:

SELECT
regexp_extract('X789', '[0-9]', 0)
FROM
table_name

But it returns '7', not '789'.

I've also tried to remove non-numeric characters using NOT MATCH syntax ^((?!regexp).)*$:

SELECT
REGEXP_REPLACE('X789', '^((?![0-9]).)*$', '')
FROM
jav_test_ii

Can regexp_extract return multiple matches? What I'm really trying to do is clean my data to only contain numbers, or alphanumeric characters. This seems to help remove bad characters, but its not a range of characters like [0-9] is. regexp_replace(string, '�','')

EDIT: The query below was able to return '7789', which is exactly what I was looking for.

SELECT
regexp_replace("7X789", "[^0-9]+", "")
FROM
table_name

Answer

user557597 picture user557597 · Jan 28, 2014

See also this hive regexp_extract weirdness

I think regex_extract will only return the group number stated in the 3rd parameter.

regex_extract seems to only work on a line and then quit.

I don't know about the replace counterpart.

It might work on non-alphanum data though if you fed it something like this

REGEXP_REPLACE(error_code, '[^a-zA-Z0-9]+', '')


Also, for extract, see the link above and you can change it to

regexp_extract('X789', '[0-9]+', 0) for multiple numbers.

or

regexp_extract('XYZ789', '[a-zA-Z]+', 0) for multiple alpha's.