How to find best fuzzy match for a string in a large string database

guillermooo picture guillermooo · Nov 21, 2008 · Viewed 14.7k times · Source

I have a database of strings (arbitrary length) which holds more than one million items (potentially more).

I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second.

My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length.

However, as I will need to perform this operation very often, I'm thinking about building an index of the db strings to keep in memory and query the index, not the db directly.

Any ideas on how to approach this problem differently or how to build the in-memory index?

Answer

zaratustra picture zaratustra · Nov 21, 2008

This paper seems to describe exactly what you want.

Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.