A large corpus of data in our daily engineering are binomial. There are many ways to smooth such data. My colleague, Rick Jin, recently referred me to a paper: Click-Through Rate Estimation forRare Events in Online Advertising, by Xuerui and others, where a novel solution is provided:
Given data are samples drawn from a binomial distribution, if we put a beta prior distribution upon the binomial parameter, using the conjugacy between beta and binomial, we can integrate out the binomial parameter in the log-likelihood function. Therefore, the log-likelihood becomes a function on beta parameters but not binomial parameters. Maximizing the log-likelihood leads to an optimal choice of beta parameters. According to the definition of beta distribution, the learned beta parameters are the optimal Dirichlet smoothing factor of the binomial data.
This looks a perfect solution! Is it generally optimal? I do not think so. The word “optimal” in above paragraph exists only in the assumption that the beta prior is the correct choice. In fact, as the authors noted, other distributions can also be used as the prior over the binomial parameter. The choice of beta is due to its math property (conjugacy with binomial) leads to efficient computation — but NOT due to it optimally describes the generation process of data.
Anyway, I think this method is worth of trying. Maybe, beta does describes the generation process pretty well.