How to structure an index for type ahead for extremely large dataset using Lucene or similar?

Peter picture Peter · May 4, 2010 · Viewed 7.3k times · Source

I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements:

Must have:

  • The ability to do starts with substring matching (I type in 'st' and it should match 'Stephen')
  • The ability to return results very quickly, I'd say 500ms is an upper bound.

Nice to have:

  • The ability to feed relevance information into the indexing process, so that, for example, more popular terms would be returned ahead of others and not just alphabetical, aka Google style.
  • In-word substring matching, so for example ('st' would match 'bestseller')

Note:

  • This index will purely be used for type ahead, and does not need to serve standard search queries.
  • I am not worried about getting advice on how to set up the front end or AJAX, as long as the index can be queried as a service or directly via Java code.

Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution

Answer

bajafresh4life picture bajafresh4life · May 5, 2010

If each record is relatively small (less than a few words) you can try a Trie data structure:

http://en.wikipedia.org/wiki/Trie

It is built for lightening fast prefix matching and it is relatively space efficient. I've used this data structure for the exact auto-complete functionality you're looking for, and I know of others that have done this for high volume production websites. In my experience you can expect response times of tens of milliseconds for a single query.

You can implement a Trie yourself pretty easily, or there are implementations you can download. See

Where do I find a standard Trie based map implementation in Java?

Depending on what implementation you use, it should be relatively simple to tag each indexed record with a relevance score, which you can then use to sort by when you get a list of records from a query.