How to extract a substring pattern in Postgresql

thenaturalist picture thenaturalist · Jul 13, 2015 · Viewed 22.1k times · Source

I have a column with a lot of inconsistent strings. Some of them contain a substring with a consistent pattern of '2015mmdd_AB_CD_EFG_(text)_(text)_HIJ' which I would like to extract. I feel this is a cross over case of regexp and a substring command.

My best approach so far has been a rather ugly

substring(col_name, '........_.._.._..._.+_.+_...')

which does not end the output as desired, rahter the output is like '(...)_HIJ_blablabla'.

How can I effectively combine pattern recognition and substring selection in this case?

Answer

a_horse_with_no_name picture a_horse_with_no_name · Jul 13, 2015

Assuming that 2015mmdd actually means some kind of "date", so that the real data contains e.g. 20150713 the following will do:

substring(col_name, '[0-9]{8}_[A-Z]{2}_[A-Z]{2}_[A-Z]{3}_\([a-z]+\)_\([a-z]+\)')

This returns substrings that start with 8 numbers followed by an underscore, followed by two uppercase characters followed by an underscore followed by two uppercase characters, followed by an underscore followed by three uppercase characters, followed by an underscore followed by an opening parentheses followed by at lease one lowercase letter, followed by a closing parentheses, followed by an underscore, followed by an opening parentheses, followed by at least one lowercase character followed by a closing parentheses.

If 2015mmdd indeed means 2015 followed by the string mmdd then you need this:

substring(col_name, '[0-9]{4}mmdd_[A-Z]{2}_[A-Z]{2}_[A-Z]{3}_\([a-z]+\)_\([a-z]+\)')