Hello people I'm implementing the Stupid Backoff (page 2, equation 5) smoothing technique for a project I'm working on and I have a question on its implementation. This is a smoothing algorithm used in NLP, Good-Turing is I guess the most well known similar algorithm.
A brief description of the algorithm is: When trying to find the probability of word appearing in a sentence it will first look for context for the word at the n-gram level and if there is no n-gram of that size it will recurse to the (n-1)-gram and multiply its score with 0.4. The recursion stops at unigrams.
So if I want to find the probability of "day" in the context of "a sunny day" it would first look to see if the tri-gram "a sunny day" exists in the corpus, if not it would try the same with the bigram "sunny day" and finally it would just get the frequency for "day" divided by the corpus size (total number of words in the training data).
My question is: Do I multiply the score with 0.4 every time I reduce the size of the n-gram?
So in the above example if we are not able to find a tri-gram or bi-gram the final score would be:
0.4 * 0.4 * frequency(day) / corpus_size?
or do I just multiply once at the final level so regardless of how many backoffs I have to make I just multiply the final score with 0.4?
Basically I read equation 5 as you describe in your math above.
So for "a sunny day" where no instance was observed, you would calculate S("day" | "a sunny"). Not finding the trigram "a sunny day" you would take case two in equation 5, and estimate S("day" | "a sunny") as alpha * S("day" | "sunny").
If again, you recorded no observances of "sunny day" you would approximate S("day" | "sunny") as alpha * S("day"), which is the terminal case f("day") / N (the number of observed unigrams).
By setting alpha to 0.4 you get exactly what you wrote out above.
Hope this helps.
-bms20