I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements:
Must have:
Nice to have:
Note:
Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution
If each record is relatively small (less than a few words) you can try a Trie data structure:
http://en.wikipedia.org/wiki/Trie
It is built for lightening fast prefix matching and it is relatively space efficient. I've used this data structure for the exact auto-complete functionality you're looking for, and I know of others that have done this for high volume production websites. In my experience you can expect response times of tens of milliseconds for a single query.
You can implement a Trie yourself pretty easily, or there are implementations you can download. See
Where do I find a standard Trie based map implementation in Java?
Depending on what implementation you use, it should be relatively simple to tag each indexed record with a relevance score, which you can then use to sort by when you get a list of records from a query.