I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK. Unfortunately, the whole documentation is rather sparse.
What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution.
I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:
import nltk
ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")
freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
prob_sum += kneser_ney.prob(i)
print(prob_sum)
The output is "41.51696428571428". Depending on the corpus size, this value grows infinitely large. That makes whatever prob() returns anything but a probability distribution.
Looking at the NLTK code I would say that the implementation is questionable. Maybe I just don't understand how the code is supposed to be used. In that case, could you give me a hint please? In any other case: do you know any working Python implementation? I don't really want to implement it myself.
I think you are misunderstanding what Kneser-Ney is computing.
From Wikipedia:
The normalizing constant λwi-1 has value chosen carefully to make the sum of conditional probabilities pKN(wi|wi-1) equal to one.
Of course we're talking about bigrams here but the same principal is true for higher order models. Basically what this quote means is that, for a fixed context wi-1 (or more context for higher order models) the probabilities of all wi must add up to one. What you are doing when you add up the probabilities of all samples is including multiple contexts, which is why you end up with a "probability" greater than 1. If you keep the context fixed, as in the following code sample, you end up with a number <= 1.
from nltk.util import ngrams
from nltk.corpus import gutenberg
gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
if i[0] == "I" and i[1] == "confess":
prob_sum += kneser_ney.prob(i)
print "{0}:{1}".format(i, kneser_ney.prob(i))
print prob_sum
The output, based on the NLTK Gutenberg corpus subset, is as follows.
(u'I', u'confess', u'.--'):0.00657894736842
(u'I', u'confess', u'what'):0.00657894736842
(u'I', u'confess', u'myself'):0.00657894736842
(u'I', u'confess', u'also'):0.00657894736842
(u'I', u'confess', u'there'):0.00657894736842
(u'I', u'confess', u',"'):0.0328947368421
(u'I', u'confess', u'that'):0.164473684211
(u'I', u'confess', u'"--'):0.00657894736842
(u'I', u'confess', u'it'):0.0328947368421
(u'I', u'confess', u';'):0.00657894736842
(u'I', u'confess', u','):0.269736842105
(u'I', u'confess', u'I'):0.164473684211
(u'I', u'confess', u'unto'):0.00657894736842
(u'I', u'confess', u'is'):0.00657894736842
0.723684210526
The reason why this sum (.72) is less than 1 is that the probability is calculated only on trigrams appearing in the corpus where the first word is "I" and the second word is "confess." The remaining .28 probability is reserved for wis which do not follow "I" and "confess" in the corpus. This is the whole point of smoothing, to reallocate some probability mass from the ngrams appearing in the corpus to those that don't so that you don't end up with a bunch of 0 probability ngrams.
Also doesn't the line
ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")
compute character trigrams? I think this needs to be tokenized to compute word trigrams.