## Graphical SVN Diff

July 30, 2010
udo apt-get install meld
svn diff --diff-cmd meld mrml_mappers_and_reducers.cc


## Backoff in N-Gram Language Models

July 28, 2010

My colleagues, Jian Zhu, Rick Jin and Charlie Yan, taught me something interesting about NLP. Here follows part of what I learned from them.

An n-gram model can be easily computed by counting the number of transitions from history h to a word w. This method is known as maximum likelihood estimation (MLE).

However, unless we have sufficiently large training corpus which makes the ML estimate very confident, our model may be embarrassing if a queries hw has never appeared in the training corpus, the case usually denoted by C(hw)=0.

C(hw)=0 is embarrassing because “unseen” does not mean “impossible”: That we do not see a transition from h to w in a corpus does not mean that this transition will not happen in any other corpora.

An often-used solution to this embarrassing is “smoothing” over the function of w, P(w|h) for every h, and make those w’s with P(w|h)=0 (i.e., C(hw)=0) a value little larger than 0. From where these little values come from? An answer is to reduce those P(w|h) of w’s with P(w|h)>0. This step is known as discounting. The collected value can then be distributed to those w’s with P(w|h)=0. This whole discounting-based smoothing operation is like to tax rich people and distribute the money to poor people (those with 0 cents).

Discounting is not the only strategy of smoothing. A very simple and well known method is Laplacian smoothing. It does not tax any rich people, instead, it gives every people (rich and poor) a cent.

There are many ways to distribute the tax from rich people to poor people. Backoff is one of them. If C(hw)=0, the collected tax is given to P(w|h) by taking C(h’w) into consideration. Precisely, $P(w|h)$ is computed by in the following recursive rule:

$P_{\text{backoff}}(w | h) = \begin{cases} P^*(w | h) & c(hw)>0 \\ \alpha(h) P_{\text{backoff}}(w|h^-) & \text{otherwise} \end{cases}$

where $h^-$ is $h$ with the oldest word removed, $p^*$ is discounted probability (computed from discounted counts), and $\alpha(h)$ is the backoff weight (bow).

In practice, we usually compute $P^*(w|h)$ in an offline process, but compute $P_{\text{backoff}}(w|h)$ as an online procedure.

However, before we can compute $P_{\text{backoff}}(w|h)$ online, we need to know the backoff weight $\alpha(h)$, which should had been computed offline. The derivation of $\alpha(h)$ is based on the fact that

$\sum_w P_{\text{backoff}}(w|h) \equiv 1$

Without losing generality, let us assume that the vocabulary includes 4 words: $V=\{a, b, c, d\}$, where for a given history $h$, $C(ha)>0$, $C(hb)>0$, but $C(hc)=0$ and $C(hd)=0$. In this example, we have

$P_{\text{backoff}}(a|h) = P^*(a|h)$
$P_{\text{backoff}}(b|h) = P^*(b|h)$
$P_{\text{backoff}}(c|h) = \alpha(h) P_{\text{backoff}}(c|h^-)$
$P_{\text{backoff}}(d|h) = \alpha(h) P_{\text{backoff}}(d|h^-)$

Since the left column sums to 1, we have
$\alpha(h) = \frac{1- \left[ P^*(a|h) + P^*(b|h) \right]}{ P_{\text{backoff}}(c|h) + P_{\text{backoff}}(d|h)}$

However, the equation is not good enough, because factors in the denominator, $P_{\text{backoff}}(c|h)$ and $P_{\text{backoff}}(d|h)$, are still unknown. Another clue from above two-column equations is that the sum of the right column is also 1. More than that, we have

$P^*(a|h) + P^*(b|h) + P_{\text{backoff}}(c|h) + P_{\text{backoff}}(d|h) = 1$

This helps us write the above head-stretching denominator as

$P_{\text{backoff}}(c|h) + P_{\text{backoff}}(d|h) = 1 - P^*(a|h) + P^*(b|h)$

This gives us

$\alpha(h) = \frac{1- \left[ P^*(a|h) + P^*(b|h) \right]}{ 1 - \left[ P^*(a|h^-) + P^*(b|h^-) \right]}$

Write it in a general form, we get

$a(h) = \frac{1- \sum_{w: C(hw)>0} P^*(w|h)}{1- \sum_{w: C(hw)>0} P^*(w|h^-)}$

Note that the sum-condition in the denominator is identical as that in the nominator. The difference between the nominator and denominator lies only in $h$ v.s. $h^-$.

## A Bug On Sparse Vector

July 21, 2010

I would like to thank Xing Zhou, my great work mate, who identified a bug in my code when we work together on a machine learning tool. I know it is embarrassing, but I think writing it down here helps me from making the same mistake again. Here follows my code

      SparseRealVector::const_iterator i_x = x_.begin();
SparseRealVector::const_iterator i_new_x = new_x_.begin();

while (i_x != x_.end() && i_new_x != new_x_.end()) {
if (i_x->first < i_new_x->first) {
++i_x;
} else if (i_new_x->first < i_x->first) {
++i_new_x;
} else {                // i_x->first == i_new_x->first
if (i_x->second * i_new_x->second < 0) {
new_x_.set(i_new_x->first, 0);
}
++i_x;
++i_new_x;
}
}


A little background about the code: we use the STL map container to hold the sparse vector, and SparseRealVector::const_iterator is just stl::map::const_iterator. Different from map, when we set an element to 0, SparseRealVector::set will remove this element. The bug lies in the last “else” branch. In the case of (i_x->second * i_new_x->second < 0), we execute new_x_.set(i_x->first, 0) which removes the element pointed by i_new_x. This makes i_new_x a semantically dangling iterator. In practice, i_new_x now points to the elements next to the deleted one. Then ++i_new_x will further move to the next element. In general, this skips one element in new_x. So, the correct code should be:

  if (i_x->second * i_new_x->second < 0) {
const_iterator temp = i_new_x + 1;
new_x_.set(i_new_x->first, 0);
i_new_x = temp;
} else {
++i_new_x;
}
++i_x;


## Encoding and Decoding of The Varint32 Type Defined by Google Protocol Buffers

July 20, 2010

In this earlier post, we talked about saving multiple protocol messages in a file (or in a network communication stream). The point is, we need to record the size of each message just before the message itself. An easy solution is to format the file as { message_size, message_body }*. The C++ code in this earlier post shows how to save message_size as an integer. However, this is far from a perfect solution for two reasons:

1. There is a big-endian little-endian difference about int on different platforms.  It is true that most computers now-a-days use Intel CPU and use the little-endian byte order, but we developers do not want to make Intel CPU as a must.
2. most messages are small and do not need 4 bytes for saving their lengths.  In fact, Google Protocol Buffer document suggests that message should not longer than 1 MB.  And Google Protocol Buffer source code defaults the max message size to be 64MB.

If we are to solve the first problem only, we can adopt the network byte order using the following old and familiar network API:

#include <sys/types.h>
#include <netinet/in.h>
unsigned long htonl(unsigned long host_long)


If we are to solve both problems, a solution is to use the varint32 type provided by Google Protocol Buffer. Usually, varint32 is used in a .proto file. In the rest of this post, we introduce how to encode a C/C++ int32 variable into varint32, and how to decode it. As usual, let us start from a code snippet:

#include <google/protobuf/io/coded_stream.h>
#include "../base/common.hh"

using namespace std;

inline const uint8* ReadVarint32FromArray(const uint8* buffer, uint32* value) {
static const int kMaxVarintBytes = 10;
static const int kMaxVarint32Bytes = 5;

// Fast path:  We have enough bytes left in the buffer to guarantee that
// this read won't cross the end, so we can skip the checks.
const uint8* ptr = buffer;
uint32 b;
uint32 result;

b = *(ptr++); result  = (b & 0x7F)      ; if (!(b & 0x80)) goto done;
b = *(ptr++); result |= (b & 0x7F) <<  7; if (!(b & 0x80)) goto done;
b = *(ptr++); result |= (b & 0x7F) << 14; if (!(b & 0x80)) goto done;
b = *(ptr++); result |= (b & 0x7F) << 21; if (!(b & 0x80)) goto done;
b = *(ptr++); result |=  b         << 28; if (!(b & 0x80)) goto done;

// If the input is larger than 32 bits, we still need to read it all
// and discard the high-order bits.
for (int i = 0; i < kMaxVarintBytes - kMaxVarint32Bytes; i++) {
b = *(ptr++); if (!(b & 0x80)) goto done;
}

// We have overrun the maximum size of a varint (10 bytes).  Assume
// the data is corrupt.
return NULL;

done:
*value = result;
return ptr;
}

int main() {
static uint8 buffer[4];
for (uint32 i = 0; i < 0xffffffff; ++i) {
uint8* end = CodedOutputStream::WriteVarint32ToArray(i, buffer);
uint32 value;
CHECK_EQ(i, value);
}
return 0;
}


If you compile and run this program, it outputs nothing. This is right, because this program says something only if its encoding and decoding subroutines work inconsistently.

For each int32 value (from 0 to 0xffffffff), this program encodes it as a varint32, decodes this varint32 back, and checks the decoded value with the original value. It is clear enough to show that CodedOutputStream::WriteVarint32ToArray can do the encoding work, but the decoding work is done by a manually written function, ReadVarint32FromArray. In fact, I did not write this function, instead, I found it from Google Protocol Buffer source code file: coded_stream.cc.

P.S. Google Protocol Buffer document explains how to read and write varint32 types.

## Inference of Latent Class Models

July 18, 2010

In Infinite Latent Feature Models and the Indian Buffet Process, the authors point out that the Gibbs sampling inference in a latent class model depends on the following full conditional distribution of latent class $c_i$ (Equation 17):

$P(c_i = k | C_{-i}, X) \propto p(X|C) P(c_i=k | C_{-i})$

The left-hand side of this equation is a general form. For LDA, it is $P(z_i=k | Z_{-i}, W)$.

I am interested in the right-hand side, because it looks not trivial to compute $p(X|C)$. Do we really need to compute it? I do not think so, because $z_i$ is independent with $W_{-i}$ given $Z_{-i}$.

An example of above statement is the derivation of LDA’s full conditional distribution, where we did the following expansion:
$P(z_i = k | Z_{-i}, W) = \frac{P(z_i, Z_{-i}, w_i, W_{-i})}{P(Z_{-i}, w_i, W_{-i})} = \frac{P(z_i, Z_{-i}, w_i, W_{-i})}{P(Z_{-i}, W_{-i})}$

In the right-most part of the equation, $w_i$ is omitted because given $Z_{-i}$, it is only possible to generate $W_{-i}$ but not $w_i$.

Continuing the derivation, we have

$\frac{P(Z, W)}{P(Z_{-i}, W_{-i})} = P(z_i, w_i | Z_{-i}, W_{-i}) = p(w_i|z_i) p(z_i | Z_{-i})$

This leads to the LDA Gibbs sampling rule appearing in the literature.  So Eqn. 17 can be rewritten as

$P(c_i = k | C_{-i}, X) \propto p(x_i | c_i) P(c_i=k | C_{-i})$

## 如何在一个文件中存储多个 Protocol Messages

July 7, 2010

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

Message::ParseFromIstream(const std::istream*);

#include
#include
#include "learn-pb.pb.h"

using namespace std;

int main() {
static int kCount = 5;

KeyValuePair kv;
kv.set_key("hello");
kv.set_value("word");

ofstream os("/tmp/a");
os.write(reinterpret_cast(&kCount), sizeof(kCount));
for (int i = 0; i < kCount; ++i) {
kv.SerializeToOstream(&os);
}
os.close();

ifstream is("/tmp/a");
int count;
for (int i = 0; i < count; ++i) {
cout << "Reading the " << i << "th message from file ... \n";
kv.Clear();
if (!kv.ParseFromIstream(&is)) {
cerr << "Failed in parsing protocol message.\n";
}
if (kv.key() != "hello") {
cerr << "kv.key() = " << kv.key() << ", but real key is hello.\n";
}
if (kv.value() != "word") {
cerr << "kv.value() = " << kv.value() << ", but real value is word.\n";
}
}
is.close();

return 0;
}


message KeyValuePair {
optional string key = 1;
optional string value = 2;
}


\$ ./learn-pb.exe
Reading the 0th message from file ...
Reading the 1th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.
Reading the 2th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.
Reading the 3th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.
Reading the 4th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.