Algorithms/theory behind predictive autocomplete?

chiborg picture chiborg · Jul 12, 2012 · Viewed 11.6k times · Source

Simple word autocomplete just displays a list of words that match the characters that were already typed. But I would like to order the words in the autocomplete list according to the probability of the words occuring, depending on the words that were typed before, relying on a statistical model of a text corpus. What algorithms and data structures do I need for this? Can you give me links for good tutorials?

Answer

Fred Foo picture Fred Foo · Jul 12, 2012

You don't need probability for autocompletion. Instead, build a prefix tree (aka a trie) with the words in the corpus as keys and their frequencies as values. When you encounter a partial string, walk the trie as far as you can, then generate all the suffixes from the point you've reached and sort them by frequency.

When a user enters a previously unseen string, just add it to the trie with frequency one; when a user enters a string that you had seen (perhaps by selecting it from the candidate list), increment its frequency.

[Note that you can't do the simple increment with a probability model; in the worst case, you'd have to recompute all the probabilities in the model.]

If you want to delve deeper into this kind of algorithms, I highly suggest you read (the first chapters of) Speech and Language Processing by Jurafsky and Martin. It treats discrete probability for language processing in quite some detail.