Kneser-Ney smoothing of trigrams using Python NLTK

Question 1

Kneser-Ney smoothing of trigrams using Python NLTK

python nlp nltk smoothing

Janek Bevendorff · Feb 6, 2016 · Viewed 9.1k times · Source

Answer

Answer

I think you are misunderstanding what Kneser-Ney is computing.

From Wikipedia:

The normalizing constant λ_{w_i-1} has value chosen carefully to make the sum of conditional probabilities p_KN(w_i|w_i-1) equal to one.

Of course we're talking about bigrams here but the same principal is true for higher order models. Basically what this quote means is that, for a fixed context w_i-1 (or more context for higher order models) the probabilities of all w_i must add up to one. What you are doing when you add up the probabilities of all samples is including multiple contexts, which is why you end up with a "probability" greater than 1. If you keep the context fixed, as in the following code sample, you end up with a number <= 1.



    from nltk.util import ngrams
    from nltk.corpus import gutenberg

    gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
    freq_dist = nltk.FreqDist(gut_ngrams)
    kneser_ney = nltk.KneserNeyProbDist(freq_dist)

    prob_sum = 0
    for i in kneser_ney.samples():
        if i[0] == "I" and i[1] == "confess":
            prob_sum += kneser_ney.prob(i)
            print "{0}:{1}".format(i, kneser_ney.prob(i))
    print prob_sum

The output, based on the NLTK Gutenberg corpus subset, is as follows.



    (u'I', u'confess', u'.--'):0.00657894736842
    (u'I', u'confess', u'what'):0.00657894736842
    (u'I', u'confess', u'myself'):0.00657894736842
    (u'I', u'confess', u'also'):0.00657894736842
    (u'I', u'confess', u'there'):0.00657894736842
    (u'I', u'confess', u',"'):0.0328947368421
    (u'I', u'confess', u'that'):0.164473684211
    (u'I', u'confess', u'"--'):0.00657894736842
    (u'I', u'confess', u'it'):0.0328947368421
    (u'I', u'confess', u';'):0.00657894736842
    (u'I', u'confess', u','):0.269736842105
    (u'I', u'confess', u'I'):0.164473684211
    (u'I', u'confess', u'unto'):0.00657894736842
    (u'I', u'confess', u'is'):0.00657894736842
    0.723684210526

The reason why this sum (.72) is less than 1 is that the probability is calculated only on trigrams appearing in the corpus where the first word is "I" and the second word is "confess." The remaining .28 probability is reserved for w_is which do not follow "I" and "confess" in the corpus. This is the whole point of smoothing, to reallocate some probability mass from the ngrams appearing in the corpus to those that don't so that you don't end up with a bunch of 0 probability ngrams.

Also doesn't the line



    ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
    form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
    the beauty of the world, the paragon of animals!")

compute character trigrams? I think this needs to be tokenized to compute word trigrams.

Question 2

I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK. Unfortunately, the whole documentation is rather sparse.

What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution.

I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:

import nltk

ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")

freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
    prob_sum += kneser_ney.prob(i)
print(prob_sum)

The output is "41.51696428571428". Depending on the corpus size, this value grows infinitely large. That makes whatever prob() returns anything but a probability distribution.

Looking at the NLTK code I would say that the implementation is questionable. Maybe I just don't understand how the code is supposed to be used. In that case, could you give me a hint please? In any other case: do you know any working Python implementation? I don't really want to implement it myself.

Kneser-Ney smoothing of trigrams using Python NLTK

Answer

Related questions