Regex optional capturing group?

forsajt picture forsajt · Feb 28, 2015 · Viewed 51.7k times · Source

After hours of searching I decided to ask this question. Why doesn't this regular expression ^(dog).+?(cat)? work as I think it should work (i.e. capture the first dog and cat if there is any)? What am I missing here?

dog, cat
dog, dog, cat
dog, dog, dog

Answer

Sergey Kalinichenko picture Sergey Kalinichenko · Feb 28, 2015

The reason that you do not get an optional cat after a reluctantly-qualified .+? is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the cat as the "tail" of the .+? sequence.

If yo anchor the cat at the end of the string, i.e. use ^(dog).+?(cat)?$, you would get a match, though:

Pattern p = Pattern.compile("^(dog).+?(cat)?$");
for (String s : new String[] {"dog, cat", "dog, dog, cat", "dog, dog, dog"}) {
    Matcher m = p.matcher(s);
    if (m.find()) {
        System.out.println(m.group(1)+" "+m.group(2));
    }
}

This prints (demo 1)

dog cat
dog cat
dog null

Do you happen to know how to deal with it in case there's something after cat?

You can deal with it by constructing a trickier expression that matches anything except cat, like this:

^(dog)(?:[^c]|c[^a]|ca[^t])+(cat)?

Now the cat could happen anywhere in the string without an anchor (demo 2).