Non-greedy string regular expression matching

Victor K. picture Victor K. · May 16, 2013 · Viewed 11.9k times · Source

I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:

> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")                                         
     [,1]   
[1,] "aaaab"

Base functions behave the same way:

> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

I would expect the match to be just ab as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

Could someone please explain me what's going on?

Update. What's crazy is that in some other cases non-greedy patterns behave as expected:

> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
     [,1]                                          
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
     [,1]              
[1,] "<a href=\"abc\">"

Answer

flodel picture flodel · May 16, 2013

Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.

Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab, aaab, aab, and ab are matches to your pattern, but aaaab being the one that starts the most to the left is the one that is returned.

So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:

str_match('xxx aaaab yyy', "a.*?y") 
#      [,1]     
# [1,] "aaaab y"

Here all of the strings aaaab y, aaaab yy, aaaab yyy matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.


So what can you do to catch that last ab? Use this:

str_match('xxx aaaab yyy', ".*(a.*b)")
#      [,1]        [,2]
# [1,] "xxx aaaab" "ab"

How does it work? By adding a greedy pattern .* in the front, you are now forcing the process to put the last possible a into the captured group.