My colleagues, Jian Zhu, Rick Jin and Charlie Yan, taught me something interesting about NLP. Here follows part of what I learned from them.

An n-gram model can be easily computed by counting the number of transitions from history h to a word w. This method is known as *maximum likelihood estimation* (MLE).

However, unless we have sufficiently large training corpus which makes the ML estimate very confident, our model may be embarrassing if a queries hw has never appeared in the training corpus, the case usually denoted by C(hw)=0.

C(hw)=0 is embarrassing because “unseen” does not mean “impossible”: That we do not see a transition from h to w in a corpus does not mean that this transition will not happen in any other corpora.

An often-used solution to this embarrassing is “smoothing” over the function of w, P(w|h) for every h, and make those w’s with P(w|h)=0 (i.e., C(hw)=0) a value little larger than 0. From where these little values come from? An answer is to reduce those P(w|h) of w’s with P(w|h)>0. This step is known as *discounting*. The collected value can then be distributed to those w’s with P(w|h)=0. This whole discounting-based smoothing operation is like to tax rich people and distribute the money to poor people (those with 0 cents).

Discounting is not the only strategy of smoothing. A very simple and well known method is Laplacian smoothing. It does not tax any rich people, instead, it gives every people (rich and poor) a cent.

There are many ways to distribute the tax from rich people to poor people. Backoff is one of them. If C(hw)=0, the collected tax is given to P(w|h) by taking C(h’w) into consideration. Precisely, is computed by in the following recursive rule:

where is with the oldest word removed, is discounted probability (computed from discounted counts), and is the backoff weight (bow).

In practice, we usually compute in an offline process, but compute as an online procedure.

However, before we can compute online, we need to know the backoff weight , which should had been computed offline. The derivation of is based on the fact that

Without losing generality, let us assume that the vocabulary includes 4 words: , where for a given history , , , but and . In this example, we have

Since the left column sums to 1, we have

However, the equation is not good enough, because factors in the denominator, and , are still unknown. Another clue from above two-column equations is that the sum of the right column is also 1. More than that, we have

This helps us write the above head-stretching denominator as

This gives us

Write it in a general form, we get

Note that the sum-condition in the denominator is identical as that in the nominator. The difference between the nominator and denominator lies only in v.s. .