After hours of searching I decided to ask this question. Why doesn't this regular expression ^(dog).+?(cat)?
work as I think it should work (i.e. capture the first dog and cat if there is any)? What am I missing here?
dog, cat
dog, dog, cat
dog, dog, dog
The reason that you do not get an optional cat
after a reluctantly-qualified .+?
is that it is both optional and non-anchored: the engine is not forced to make that match, because it can legally treat the cat
as the "tail" of the .+?
sequence.
If yo anchor the cat at the end of the string, i.e. use ^(dog).+?(cat)?$
, you would get a match, though:
Pattern p = Pattern.compile("^(dog).+?(cat)?$");
for (String s : new String[] {"dog, cat", "dog, dog, cat", "dog, dog, dog"}) {
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1)+" "+m.group(2));
}
}
This prints (demo 1)
dog cat
dog cat
dog null
Do you happen to know how to deal with it in case there's something after cat?
You can deal with it by constructing a trickier expression that matches anything except cat
, like this:
^(dog)(?:[^c]|c[^a]|ca[^t])+(cat)?
Now the cat
could happen anywhere in the string without an anchor (demo 2).