How to structure an index for type ahead for extremely large dataset using Lucene or similar?

autocomplete lucene indexing typeahead

Peter · May 4, 2010 · Viewed 7.3k times · Source

I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements:

Must have:

The ability to do starts with substring matching (I type in 'st' and it should match 'Stephen')
The ability to return results very quickly, I'd say 500ms is an upper bound.

Nice to have:

The ability to feed relevance information into the indexing process, so that, for example, more popular terms would be returned ahead of others and not just alphabetical, aka Google style.
In-word substring matching, so for example ('st' would match 'bestseller')

Note:

This index will purely be used for type ahead, and does not need to serve standard search queries.
I am not worried about getting advice on how to set up the front end or AJAX, as long as the index can be queried as a service or directly via Java code.

Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution

Answer

If each record is relatively small (less than a few words) you can try a Trie data structure:

http://en.wikipedia.org/wiki/Trie

It is built for lightening fast prefix matching and it is relatively space efficient. I've used this data structure for the exact auto-complete functionality you're looking for, and I know of others that have done this for high volume production websites. In my experience you can expect response times of tens of milliseconds for a single query.

You can implement a Trie yourself pretty easily, or there are implementations you can download. See

Where do I find a standard Trie based map implementation in Java?

Depending on what implementation you use, it should be relatively simple to tag each indexed record with a relevance score, which you can then use to sort by when you get a list of records from a query.

How to structure an index for type ahead for extremely large dataset using Lucene or similar?

Answer

Related questions