Fast Hamming distance scoring

Sardar picture Sardar · Jun 23, 2010 · Viewed 7.5k times · Source

There is a database with N fixed length strings. There is a query string of the same length. The problem is to fetch first k strings from the database that have the smallest Hamming distance to q.

N is small (about 400), strings are long, fixed in length. Database doesn't change, so we can pre-compute indexes. Queries vary strongly, caching and/or pre-computation is not an option. There are lots of them per second. We need always k results, even if k-1 results have match 0 (sorting on Hamming distance and take first k, so locality sensitive hashing and similar approaches won't do). kd-tree and similar space partitioning will probably perform worser than linear search (strings can be very long). BK-tree is currently best choice, but it is still slow and complicated than it needs to be.

It feels like there is an algorithm, which will build an index, which will discard most of the entries in very few steps, leaving k <= t << N entries to compute real Hamming distance.

People suggesting fuzzy string matching based on Levenstein distance - thanks, but the problem is much simpler. Generalized distance metric based approaches (like BK-trees) are good, but maybe there something utilizing the facts described above (small DB/long fixed size strings, simple Hamming distance)

Links, keywords, papers, ideas? =)

Answer

tbischel picture tbischel · Jun 23, 2010

This seems like a task where a Vantage Point (VP tree) might work... since the hamming distance should satisfy the triangle inequality theorem, you should be able to apply it... its also good for identifying k-closest. I've seen it in image indexing database setups... you might check section 5 of this paper as an example of what I'm talking about (albeit in a different field).