java regex to exclude specific strings from a larger one

nvrs picture nvrs · Feb 3, 2010 · Viewed 12.2k times · Source

I have been banging my head against this for some time now: I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc. So having done my regex homework the following regex should work:

(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?

As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects! Any thoughts?

cheers

Answer

bobince picture bobince · Feb 3, 2010

The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.

Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.

I suggest:

\b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b

Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.