REGEXP_REPLACE capturing groups

jatal picture jatal · Feb 18, 2015 · Viewed 10k times · Source

I was wondering if someone could help me understand how to use Hive's regexp_replace function to capture groups in the regex and use those groups in the replacement string.

I have an example problem I'm working through below that involves date-munging. In this example, my goal is to take a string date that is not compatible with SimpleDateFormat parsing and make a small adjustment to get it to be compatible. The date string (shown below) needs "GMT" prepended to the offset sign (+/-) in the string.

So, Given input:

  '2015-01-01 02:03:04 +0:00' 
  -or-
  '2015-01-01 02:03:04 -1:00' 

I want output:

  '2015-01-01 02:03:04 GMT+0:00'
  -or-
  '2015-01-01 02:03:04 GMT-1:00'

Here is a simple example of a statement that I 'thought' would work, but I get strange output.

Hive query:

select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT\1');

Actual result:

2015-01-01 02:03:04 GMT10:00

Note that the "\1" should output the matched group, but instead replaces the matched group with the number "1".

Can someone help me understand the right way to reference/output matched groups in the replacement string?

Thanks!

Answer

rchang picture rchang · Feb 18, 2015

Hive's supported notation (at least for 0.14, and I think I recall it being this way for 0.13.x as well) for regex backreferences seems to be $1 for capture group 1, $2 for capture group 2, etc. It looks like it is based upon (and may even be implemented by) the replaceAll method from the Matcher class. This is the germane portion of that documentation:

Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

So I think what you want is this:

select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');

For example:

hive> select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT+0:00
Time taken: 0.072 seconds, Fetched: 1 row(s) 
hive> select regexp_replace('2015-01-01 02:03:04 -1:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT-1:00
Time taken: 0.144 seconds, Fetched: 1 row(s)