Efficient way to search a stream for a string

Alex Spurling picture Alex Spurling · May 10, 2009 · Viewed 33k times · Source

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.

Naively, I might try to do something like this (in Java):

public boolean streamContainsString(Reader reader, String searchString) throws IOException {
    char[] buffer = new char[1024];
    int numCharsRead;
    while((numCharsRead = reader.read(buffer)) > 0) {
        if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
            return true;
    }
    return false;
}

Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:

Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"

How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?

Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.

Answer

Norman Ramsey picture Norman Ramsey · May 11, 2009

There are three good solutions here:

  1. If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):

    String needle;
    n = needle.length();
    
    for every input character c do
      add index 0 to the list
      for every index i in the list do
        if c == needle[i] then
          if i + 1 == n then
            return true
          else
            replace i in the list with i + 1
          end
        else
          remove i from the list
        end
      end
    end
    

    This will find the string if it exists and you will never need a buffer.

  2. Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.

  3. If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.