Using Java Regex, how to check if a string contains any of the words in a set ?

user193116 picture user193116 · Mar 1, 2012 · Viewed 96.5k times · Source

I have a set of words say -- apple, orange, pear , banana, kiwi

I want to check if a sentence contains any of the above listed words, and If it does , I want to find which word matched. How can I accomplish this in Regex ?

I am currently calling String.indexOf() for each of my set of words. I am assuming this is not as efficient as a regex matching?

Answer

Dave Webb picture Dave Webb · Mar 1, 2012

TL;DR For simple substrings contains() is best but for only matching whole words Regular Expression are probably better.

The best way to see which method is more efficient is to test it.

You can use String.contains() instead of String.indexOf() to simplify your non-regexp code.

To search for different words the Regular Expression looks like this:

apple|orange|pear|banana|kiwi

The | works as an OR in Regular Expressions.

My very simple test code looks like this:

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

The results I got were as follows:

Contains took 5962ms
Regular Expression took 63475ms

Obviously timings will vary depending on the number of words being searched for and the Strings being searched, but contains() does seem to be ~10 times faster than regular expressions for a simple search like this.

By using Regular Expressions to search for Strings inside another String you're using a sledgehammer to crack a nut so I guess we shouldn't be surprised that it's slower. Save Regular Expressions for when the patterns you want to find are more complex.

One case where you may want to use Regular Expressions is if indexOf() and contains() won't do the job because you only want to match whole words and not just substrings, e.g. you want to match pear but not spears. Regular Expressions handle this case well as they have the concept of word boundaries.

In this case we'd change our pattern to:

\b(apple|orange|pear|banana|kiwi)\b

The \b says to only match the beginning or end of a word and the brackets group the OR expressions together.

Note, when defining this pattern in your code you need to escape the backslashes with another backslash:

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");