When would you use KMP over BOYER-MOORE

Eric picture Eric · Apr 18, 2013 · Viewed 13.6k times · Source

I am currently learning about pattern matching algorithms and have come across these two algorithms. I have the following general ideas:

KMP

  • Compares text left-to-right
  • Uses a failure array to shift intelligently
  • takes O(m), where m is the length of the pattern, to compute failure array
  • takes O(m), space
  • takes O(n), time to search a string

BM

  • Compares pattern from last character
  • Uses bad character jumps and good suffix jumps
  • takes O(m + size of alphabet) to compute tables
  • takes O(m + size of alphabet), space
  • takes O(n), but usually better to search

I came across the following question which triggered this question(True or False):

The Knuth-Morris-Pratt (KMP) algorithm is a good choice if we want to search for the same pattern repeatedly in many different texts.

So I believe the answer is true just because the assumption is that every time you run the algorithm on different text the preprocessing is only O(n) where for BM it is O(n + size of alphabet). However, I am not sure if I am making the correct assumption that every time the algorithm is rerun a new table is recomputed. Because say the text always falls in the alphabet of english. I would only need to compute the table once and just reuse the table. So at the end of the day, would the answer to this question be dependent on the fact that the algorithms are all being run on text which is contained in the same alphabet or is there some other factor which may affect it?

Answer

tmyklebu picture tmyklebu · Apr 18, 2013

In theory, both algorithms will have "similar" performance; KMP will do about 2n comparisons in the searching phase and Boyer-Moore will do about 3n comparisons in the searching phase in the worst case. In neither case do you need to repeat the preprocessing when you get a new text.

But the real answer is that you shouldn't use either one in practice.

The linear auxiliary storage needed by both algorithms leads to considerably...rougher performance on modern architectures because of all of the extra memory accesses.

However, the ideas behind Boyer-Moore and KMP underpin most fast string matching algorithms. Something like KMP's "failure function" idea is used by every practically effective string matching algorithm I know of; it turns out that you can compute a suboptimal "failure function" for a pattern on-the-fly that still gives you linear time matching while only needing constant additional space. Boyer-Moore is faster than linear in the "average case" of matching a fixed pattern against random noise, and this bears itself out in many practical situations.