## Regularizations for General Linear Regression

September 28, 2010

The lasso [Tibshirani, 1996] refers to L1-regularization of linear models:

$\Omega_\lambda(\vec{w}) = \lambda \sum_{i=1}^D |w_i|$

In group lasso [Yuan and Lin 2006], the D predictors are divided into G groups, and the g-th group has $D_g$ predictors. The regularization term becomes:

$\Omega_\lambda(\vec{w}) = \lambda \sum_{g=1}^G \sqrt{D_g} ||\vec{w}_g||_2$,

where $||\cdot||_2$ denotes the L2-norm: $||\vec{x}||_2 = \sqrt{\sum x_i^2}$.  So, when all $D_g$‘s are 1, group lasso becomes lasso.

A further extension to group lasso is sparse group lasso:

$\Omega_{\lambda,r}(\vec{w}) =\lambda \sum_{l=1}^G \sqrt{D_g}||\vec{w}_g||_2 + r_g |\vec{w}_g|$

But in practice, we may not need this very complex regularization.

## Is Word Segmentation Parallelizable?

September 12, 2010

I know little about NLP, I read nothing about word segmentation algorithm; however, given my knowledge on hidden markov model and high-order markov model, I though word segmentation is, given

1. a sequence of characters $S$,
2. a set of words (each word is a sequence of characters), and
3. a language model (first-order transitions between words) $P(w_j|w_i)$

to infer $n (n\geq{}0)$ segments in $S$, and get $\{w_1,\ldots,w_n\}$.

If the optimal segmentation is the one maximizing $O=P(w_1,\ldots,w_n)$, then it can be solved using dynamic programming strategy.  Use the following notations:

• $M$: the maximal length of a word,
• $L(S)$: the length of string $S$,
• $sos$: the special word denoting the start of a sentence,
• $eos$: the special word denoting the end of a sentence.

the dynamic programming solution is as the following recursive process:

1. Start: $\max_{1\leq l \leq M} P(S_{0,l-1} | sos) O(S_{l,L(S)} | S_{0,l-1})$
2. Recursion: $O(S|w) = \max_{1\leq l \leq M} P(S_{0,l-1}|w) O(S_{l,L(S)} | S_{0,l-1})$
3. Stop: if $l>L(S)$, then $S_{l,L(S)}\equiv eos$.

As a tail-recursion, this recursive process can be expressed as an equivalent iteration form: starting by checking the M possible words near the end of the sentence, for each of them, check possible word segmentation at the end of rest sentence.

Alternatively, above procedure can be expressed in an equivalent form: segment the first word at the end of the sentence. This alternative expression results in an iteration starting word segmentation from the beginning of the sentence.

If each processor has the whole language model, it would be trivial to parallelize above procedure. For example, in the Start step, we need to compute M branches before taking $\max$. These M computations can be distributed on M processors.

But if the language model is huge and is distributed on multiple processors, it seems difficult (?) to divide-and-conquer the segmentation over these processors with an efficient sharding of both computational structural (iteration steps) and storage structural (parts of language model).

The final question is: usually the sentence $S$ is not long, it might not be worth to distribute it over processors for distributed segmentation.

## 在Mac OS X上运行自由门

September 10, 2010

1. 下载和安装WineBottler：一个for Mac OS X的WINE prebuild package。
下载地址：http://winebottler.kronenberg.org/
2. 下载自由门7.04版（fg704p.exe)。
3. 下载mfc42.dll，放到fg704p.exe同一个目录。
下载地址：http://dongtaiwang.com/loc/software/dll/mfc42.dll
4. 在Finder里右键点击fg704p.exe，选择“用Wine打开”。

## Confidence intervals for parameters of normal distribution

September 9, 2010

$\hat{\mu}=\bar{X}\rightarrow\mu, \quad \hat{\sigma^2}=\bar{X^2}-(\bar{X})^2\rightarrow\sigma^2$

$\sqrt{n}\bar{X}\sim N(0,1)$ and $n(\bar{X^2}-(\bar{X})^2)\sim\chi^2(n-1)$

(1) $\frac{\sqrt{n}}{\sigma}(\bar{X}-\mu)\sim N(0,1)$
(2) $\frac{n\hat{\sigma^2}}{\sigma^2}\sim\chi^2(n-1)$

(3)  $\frac{\sqrt{n-1}}{\hat{\sigma}}(\bar{X}-\mu)\sim t(n-1)$

## Learning Ranking Using Probit Regression

September 6, 2010

This SIGIR 2006 poster shows that the learning of a ranking function, given pairwise preferences over a known set of document ids, can be formalized as a specialized probit regression problem.  The idea comes from this 2003 paper, which in addition explained how to solve this specialized probit regression problem using existing statistical software.  (Though, in the case of Web page ranking, I guess no exiting statistical software can be employed.)

For each query, the relevance of document i is denoted by $\theta_i$, and given some pairs of

$y_t = \begin{cases} 1 & d_{k,1} \geq d_{k,2} \\ 0 & d_{k,1} < d_{k,2} \end{cases}$

We can estimate $\theta_i's$ by maximizing the likelihood function

$L(\Theta)=\prod_{k=1}^K P(\theta_{k,1} \geq \theta_{k,2})^{y_k} \left[1-P(\theta_{k,1}\geq \theta_{k,2})\right]^{1-y_i}$

If we assume that the relevance weights are distributed normally with mean $\theta_i$ and variance $\frac{1}{2}$, then the cumulative density function $P(\theta_{k,1}\geq\theta_{k,2})$ becomes $\Phi(\theta_{k,1}-\theta_{k,2})$, the normal cumulative distribution, and the likelihood becomes:

$L(\Theta)=\prod_{k=1}^K \Phi(\theta_{k,1}-\theta_{k,2})^{y_k} \left[1-\Phi(\theta_{k,1}-\theta_{k,2})\right]^{1-y_i}$

However, my point is that this method is hard to implement in search engine, as it makes use of document ids and clicks on documents, there exist a huge number of parameters (equals to the number of query-document pairs), and relatively a much much less training data set.