what is the fastest substring search method in Java

Joey picture Joey · Aug 20, 2013 · Viewed 21.8k times · Source

I need to implement a way to search substring (needles) in a list of string (haystack) using Java.

More specifically, my app has a list of user profiles. If I type some letters, for example, "Ja", and then search, then all the users whose name contains "ja" should show up. For instance, the result could be "Jack", "Jackson", "Jason", "Dijafu".

In Java, as I know, there are 3 build-in method to see search substring in a string.

  1. string.contains()

  2. string.indexOf()

  3. regular expression. it is something like string.matches("ja"))

My question is: What are the runtimes of each method above? which one is the fastest or most efficient or most popular way to check if the list of string contains a given substring.

I know there exists some algorithms that do the same thing, such as Boyer–Moore string search algorithm, Knuth–Morris–Pratt algorithm and so on. I do not want to use them because I just have a small list of strings, and I think using them is kind of overkill for me right now. Also I have to type a lot of extra coding for such a non-build-in algorithm. If you think my thoughts is not correct, please feel free to correct me.

Answer

CoronA picture CoronA · Aug 15, 2016

The accepted answer is not correct and not complete.

  • indexOf() does a naive string search using backtracking on mismatches. This is quite fast on small patterns/texts but shows very poor performance on large texts
  • contains("ja") should be comparable to indexOf (because it delegates to it)
  • matches("ja") will not deliver the correct result, because it searches for an exact match (only the string "ja" will match exactly)
  • Pattern p = Pattern.compile("ja"); Matcher m = p.matcher("jack"); m.find(); would be the correct way to find texts with regular expressions. In practice (using large texts) it will be the most efficient way using only the java api. This is because a constant pattern (like "ja") will not be processed by the regex engine (which is slow) but by an Boyer-Moore-Algorithm (which is fast)