Backoff in N-Gram Language Models

My colleagues, Jian Zhu, Rick Jin and Charlie Yan, taught me something interesting about NLP. Here follows part of what I learned from them.

An n-gram model can be easily computed by counting the number of transitions from history h to a word w. This method is known as maximum likelihood estimation (MLE).

However, unless we have sufficiently large training corpus which makes the ML estimate very confident, our model may be embarrassing if a queries hw has never appeared in the training corpus, the case usually denoted by C(hw)=0.

C(hw)=0 is embarrassing because “unseen” does not mean “impossible”: That we do not see a transition from h to w in a corpus does not mean that this transition will not happen in any other corpora.

An often-used solution to this embarrassing is “smoothing” over the function of w, P(w|h) for every h, and make those w’s with P(w|h)=0 (i.e., C(hw)=0) a value little larger than 0. From where these little values come from? An answer is to reduce those P(w|h) of w’s with P(w|h)>0. This step is known as discounting. The collected value can then be distributed to those w’s with P(w|h)=0. This whole discounting-based smoothing operation is like to tax rich people and distribute the money to poor people (those with 0 cents).

Discounting is not the only strategy of smoothing. A very simple and well known method is Laplacian smoothing. It does not tax any rich people, instead, it gives every people (rich and poor) a cent.

There are many ways to distribute the tax from rich people to poor people. Backoff is one of them. If C(hw)=0, the collected tax is given to P(w|h) by taking C(h’w) into consideration. Precisely, P(w|h) is computed by in the following recursive rule:

P_{\text{backoff}}(w | h) = \begin{cases} P^*(w | h) & c(hw)>0 \\ \alpha(h) P_{\text{backoff}}(w|h^-) & \text{otherwise} \end{cases}

where h^- is h with the oldest word removed, p^* is discounted probability (computed from discounted counts), and \alpha(h) is the backoff weight (bow).

In practice, we usually compute P^*(w|h) in an offline process, but compute P_{\text{backoff}}(w|h) as an online procedure.

However, before we can compute P_{\text{backoff}}(w|h) online, we need to know the backoff weight \alpha(h), which should had been computed offline. The derivation of \alpha(h) is based on the fact that

\sum_w P_{\text{backoff}}(w|h) \equiv 1

Without losing generality, let us assume that the vocabulary includes 4 words: V=\{a, b, c, d\}, where for a given history h, C(ha)>0, C(hb)>0, but C(hc)=0 and C(hd)=0. In this example, we have

P_{\text{backoff}}(a|h) = P^*(a|h)
P_{\text{backoff}}(b|h) = P^*(b|h)
P_{\text{backoff}}(c|h) = \alpha(h) P_{\text{backoff}}(c|h^-)
P_{\text{backoff}}(d|h) = \alpha(h) P_{\text{backoff}}(d|h^-)

Since the left column sums to 1, we have
\alpha(h) = \frac{1- \left[ P^*(a|h) + P^*(b|h) \right]}{ P_{\text{backoff}}(c|h) + P_{\text{backoff}}(d|h)}

However, the equation is not good enough, because factors in the denominator, P_{\text{backoff}}(c|h) and P_{\text{backoff}}(d|h), are still unknown. Another clue from above two-column equations is that the sum of the right column is also 1. More than that, we have

P^*(a|h) + P^*(b|h) + P_{\text{backoff}}(c|h) + P_{\text{backoff}}(d|h) = 1

This helps us write the above head-stretching denominator as

P_{\text{backoff}}(c|h) + P_{\text{backoff}}(d|h) = 1 - P^*(a|h) + P^*(b|h)

This gives us

\alpha(h) = \frac{1- \left[ P^*(a|h) + P^*(b|h) \right]}{ 1 - \left[ P^*(a|h^-) + P^*(b|h^-) \right]}

Write it in a general form, we get

a(h) = \frac{1- \sum_{w: C(hw)>0} P^*(w|h)}{1- \sum_{w: C(hw)>0} P^*(w|h^-)}

Note that the sum-condition in the denominator is identical as that in the nominator. The difference between the nominator and denominator lies only in h v.s. h^-.