I have a column with a lot of inconsistent strings. Some of them contain a substring with a consistent pattern of '2015mmdd_AB_CD_EFG_(text)_(text)_HIJ'
which I would like to extract. I feel this is a cross over case of regexp and a substring command.
My best approach so far has been a rather ugly
substring(col_name, '........_.._.._..._.+_.+_...')
which does not end the output as desired, rahter the output is like '(...)_HIJ_blablabla'.
How can I effectively combine pattern recognition and substring selection in this case?
Assuming that 2015mmdd
actually means some kind of "date", so that the real data contains e.g. 20150713
the following will do:
substring(col_name, '[0-9]{8}_[A-Z]{2}_[A-Z]{2}_[A-Z]{3}_\([a-z]+\)_\([a-z]+\)')
This returns substrings that start with 8 numbers followed by an underscore, followed by two uppercase characters followed by an underscore followed by two uppercase characters, followed by an underscore followed by three uppercase characters, followed by an underscore followed by an opening parentheses followed by at lease one lowercase letter, followed by a closing parentheses, followed by an underscore, followed by an opening parentheses, followed by at least one lowercase character followed by a closing parentheses.
If 2015mmdd
indeed means 2015
followed by the string mmdd
then you need this:
substring(col_name, '[0-9]{4}mmdd_[A-Z]{2}_[A-Z]{2}_[A-Z]{3}_\([a-z]+\)_\([a-z]+\)')