I like Peter Norvig’s post on spelling corrector (http://norvig.com/spell-correct.html) very much. The proposed algorithm aims to maximize the a posteriori:

c* = argmax_c P(c|w) = argmax_c P(w|c) P(c)

where P(w|c) is defined as

P(w|c) = L_i, i = edit_distance(w,c) (1)

and developers should set L_0 > L_1 > L_2 > … .

In his post and book chapter (http://norvig.com/ngrams/), Peter also analyzied some bad cases, which can be solved by employing a more detailed parameterziation of P(w|c), which takes into consideration that there are 4 sub-cases, inserts, deletes, transposes and replaces, under the case edit_distance(w,c)=1. Similarly there are 16 sub-cases given edit_distance(w,c)=2 and etc:

L_0 L_1,j, j \in O P(w|c) = L_2,j, j \in O^2 (2) : :

where the set O = { inserts, deletes, transposes, replaces }.

This parameterization can incorporate more human knowledge than (1), if developers know how to set L_i,j’s, which, however, is not trivial.

A solution is to gather knowledge from data but not developers. This can be achieved by training a click-through rate (CTR) prediction model as we did in search engine. Consdier that many spell correctors, e.g., that in Microsoft Word or ispell for Emacs, give users multiple choices and let users chooce the best correction. This is similar to users’ clicking on search results.

By introducing the action of click, denoted by r, to indicate whether w and c are related, we factorize the casualty between w and c as the following:

w <---------- c (i) Bayesian network representing P(w|c) w ---> r <--- c (ii) Bayesian network representing P(r|w,c)

Given this factorization, we reformalize the objective:

P(c|w) = sum_r P(r,c|w) = sum_r P(r|c,w) P(c)

Intuitively, if users do not choose c given w, c and w are not likely related. This allows us to simplify above equation by

P(c|w) = P(r=1|c,w) P(c) (3)

By considering each sub-case in (2) a “feature” or “reason”, that effects whether users click c given w, we have the following

factorization

w ---> t <--- c | \|/ r

and the objective function becomes

P(c|w) = P(r=1|c,w) P(c) = P(c) sum_t [ P(r=1|t) P(t|c,w) ] (4)

Factors in this equation can be estimated as follows:

1. P(c) can be estimated from training corpus as Peter Norvig did in his post.

2. P(t|c,w) can be computed using Peter’s candidate correction generating algorithm.

3. P(r=1|t) can be estimated from the impression log and click log generated by spelling correctors.

The parameterization in (4) can also explain click models used in Web search engine, click-through rate prediction in online advertising, and collaborative filtering in recommendation systems.