Regularizations for General Linear Regression

September 28, 2010

The lasso [Tibshirani, 1996] refers to L1-regularization of linear models:

\Omega_\lambda(\vec{w}) = \lambda \sum_{i=1}^D |w_i|

In group lasso [Yuan and Lin 2006], the D predictors are divided into G groups, and the g-th group has D_g predictors. The regularization term becomes:

\Omega_\lambda(\vec{w}) = \lambda \sum_{g=1}^G \sqrt{D_g} ||\vec{w}_g||_2,

where ||\cdot||_2 denotes the L2-norm: ||\vec{x}||_2 =  \sqrt{\sum x_i^2}.  So, when all D_g‘s are 1, group lasso becomes lasso.

A further extension to group lasso is sparse group lasso:

\Omega_{\lambda,r}(\vec{w}) =\lambda \sum_{l=1}^G \sqrt{D_g}||\vec{w}_g||_2 + r_g |\vec{w}_g|

But in practice, we may not need this very complex regularization.


Is Word Segmentation Parallelizable?

September 12, 2010

I know little about NLP, I read nothing about word segmentation algorithm; however, given my knowledge on hidden markov model and high-order markov model, I though word segmentation is, given

  1. a sequence of characters S,
  2. a set of words (each word is a sequence of characters), and
  3. a language model (first-order transitions between words) P(w_j|w_i)

to infer n (n\geq{}0) segments in S, and get \{w_1,\ldots,w_n\}.

If the optimal segmentation is the one maximizing O=P(w_1,\ldots,w_n), then it can be solved using dynamic programming strategy.  Use the following notations:

  • M: the maximal length of a word,
  • L(S): the length of string S,
  • sos: the special word denoting the start of a sentence,
  • eos: the special word denoting the end of a sentence.

the dynamic programming solution is as the following recursive process:

  1. Start: \max_{1\leq l \leq M} P(S_{0,l-1} | sos) O(S_{l,L(S)} | S_{0,l-1})
  2. Recursion: O(S|w) = \max_{1\leq l \leq M} P(S_{0,l-1}|w) O(S_{l,L(S)} | S_{0,l-1})
  3. Stop: if l>L(S), then S_{l,L(S)}\equiv eos.

As a tail-recursion, this recursive process can be expressed as an equivalent iteration form: starting by checking the M possible words near the end of the sentence, for each of them, check possible word segmentation at the end of rest sentence.

Alternatively, above procedure can be expressed in an equivalent form: segment the first word at the end of the sentence. This alternative expression results in an iteration starting word segmentation from the beginning of the sentence.

If each processor has the whole language model, it would be trivial to parallelize above procedure. For example, in the Start step, we need to compute M branches before taking \max. These M computations can be distributed on M processors.

But if the language model is huge and is distributed on multiple processors, it seems difficult (?) to divide-and-conquer the segmentation over these processors with an efficient sharding of both computational structural (iteration steps) and storage structural (parts of language model).

The final question is: usually the sentence S is not long, it might not be worth to distribute it over processors for distributed segmentation.


在Mac OS X上运行自由门

September 10, 2010

在GFW Blog 上有一篇文章,介绍如何在Mac OS X上运行自由门。那个方法需要安装XCode和MacPorts,然后让MacPorts调用XCode编译WINE的源码。其实实在不需要这么复杂,因为有不止一个的为Mac OS X编译好了的WINE,比如WineBottler。利用这些编译好的WINE for Mac OS X,安装就很简单了:

  1. 下载和安装WineBottler:一个for Mac OS X的WINE prebuild package。
    下载地址:http://winebottler.kronenberg.org/
  2. 下载自由门7.04版(fg704p.exe)。
  3. 下载mfc42.dll,放到fg704p.exe同一个目录。
    下载地址:http://dongtaiwang.com/loc/software/dll/mfc42.dll
  4. 在Finder里右键点击fg704p.exe,选择“用Wine打开”。

Confidence intervals for parameters of normal distribution

September 9, 2010

这是我在阅读 Lecture 5 of MIT OpenCourse on Statistics for Applications 的笔记,但是内容多于这个Lecture,因为还包括了同事李玉龙教我的一些内容。

为了估计一个正态分布 N(\mu,\sigma^2) 的参数:\mu\sigma^2,我们通常需要一些样本 X_1,\ldots,X_n\sim N(\mu,\sigma),然后用最大似然估计(MLE):

\hat{\mu}=\bar{X}\rightarrow\mu, \quad \hat{\sigma^2}=\bar{X^2}-(\bar{X})^2\rightarrow\sigma^2

然而估计都是可能有错的。而confidence interval(置信区间)是比估计更通用的一个概念——正确的估计在这个区间中的概率是 \alpha (比如\alpha=95\%)。

那么 \hat{\mu},\hat{\sigma^2} 的置信区间分别是什么呢?

定义1 :If X_i\sim N(0,1), then \sum_{i=1}^n X_i^2 \sim \chi^2(n)

联想1:If X_i\sim N(\mu,\sigma^2), then \sum_{i=1}^n X_i \sim N(\mu,\sigma^2/n)

定理1:If X_i\sim N(0,1), then
\sqrt{n}\bar{X}\sim N(0,1) and n(\bar{X^2}-(\bar{X})^2)\sim\chi^2(n-1)

推论1:If X_i\sim N(\mu,\sigma^2), then
(1) \frac{\sqrt{n}}{\sigma}(\bar{X}-\mu)\sim N(0,1)
(2) \frac{n\hat{\sigma^2}}{\sigma^2}\sim\chi^2(n-1)

上面(1)和(2)中,N(0,1)\chi^2(n) 的函数形状都是 x 轴上方隆起一个包。而置信度为 \alpha 的置信区间就是裁掉函数形状左端和右端面积分别为 (1-\alpha)/2 的两块,剩下的中间这段区间 [c_1,c_2]。其中 c_1, c_2可以通过查函数表得到。

根据(2),给定一组采样,n, \chi^2(n), \hat{\sigma^2} 都是已知的,只有我们关心的 \sigma^2 是未知的。所以根据 c_1 \leq \frac{n\hat{\sigma^2}}{\sigma^2} \leq c_2 就可以得到 \frac{n\hat{\sigma^2}}{c_2} \leq \sigma^2 \leq \frac{n\hat{\sigma^2}}{c_1}

但是在(1)中,有两个未知参数 \mu\sigma。如果我们能假设 \sigma 是已知的,那么我们可以查 N(0,1) 的置信区间函数表,用上面类似的办法得到 \mu 的置信区间。可是现实中很难假设 \sigma 已知。此时我们得借用 Student-t 分布的概念和一些数学变形技巧

定义2:If Y_0, Y_1,\ldots,Y_n\sim N(0,1), then Y_0 / \sqrt{\frac{1}{n} (Y_1^2+\ldots+Y_n^2)} \sim t(n)

因为(1)所以可以有 Y_0=\frac{\sqrt{n}}{\sigma}(\bar{X}-\mu);因为(2),所以可以有 \frac{n\hat{\sigma^2}}{\sigma^2}=Y_1+\ldots+Y_n。代入定义2,得到

(3)  \frac{\sqrt{n-1}}{\hat{\sigma}}(\bar{X}-\mu)\sim t(n-1)

这里只有 \mu 是未知的,所以可以通过查表 t(n-1) 得到 c,并且有 -c\leq\frac{\sqrt{n-1}}{\hat{\sigma}}(\bar{X}-\mu)\leq c;变形得到 \bar{X}-c\frac{\hat{\sigma}}{\sqrt{n-1}}\leq \mu \leq\bar{X}+c\frac{\hat{\sigma}}{\sqrt{n-1}}


Learning Ranking Using Probit Regression

September 6, 2010

This SIGIR 2006 poster shows that the learning of a ranking function, given pairwise preferences over a known set of document ids, can be formalized as a specialized probit regression problem.  The idea comes from this 2003 paper, which in addition explained how to solve this specialized probit regression problem using existing statistical software.  (Though, in the case of Web page ranking, I guess no exiting statistical software can be employed.)

For each query, the relevance of document i is denoted by \theta_i, and given some pairs of

y_t = \begin{cases} 1 & d_{k,1} \geq d_{k,2} \\ 0 & d_{k,1} < d_{k,2} \end{cases}

We can estimate \theta_i's by maximizing the likelihood function

L(\Theta)=\prod_{k=1}^K P(\theta_{k,1} \geq \theta_{k,2})^{y_k} \left[1-P(\theta_{k,1}\geq \theta_{k,2})\right]^{1-y_i}

If we assume that the relevance weights are distributed normally with mean \theta_i and variance \frac{1}{2}, then the cumulative density function P(\theta_{k,1}\geq\theta_{k,2}) becomes \Phi(\theta_{k,1}-\theta_{k,2}), the normal cumulative distribution, and the likelihood becomes:

L(\Theta)=\prod_{k=1}^K \Phi(\theta_{k,1}-\theta_{k,2})^{y_k} \left[1-\Phi(\theta_{k,1}-\theta_{k,2})\right]^{1-y_i}

However, my point is that this method is hard to implement in search engine, as it makes use of document ids and clicks on documents, there exist a huge number of parameters (equals to the number of query-document pairs), and relatively a much much less training data set.


Follow

Get every new post delivered to your Inbox.