## Pearson Correlation Coefficient, Covariance Matrix and Linear Dependency

August 29, 2010

After reading this post, I think you will find it easy to explain this picture from Wikipedia:

Several sets of (x, y) points, with the correlation coefficient of x and y for each set.

From Wikipedia,

In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the PMCC, and typically denoted by $\rho$) is a measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive.

The definition of Pearson coefficient is

$\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y}$

where $cov(X,Y)$ is the covariance between X and Y, and $\sigma_X$ and $\sigma_Y$ are the standard deviations.

For your reference, standard deviation is the square root of variance, and the estimate of variance from n samples $\{x_i\}$ is:

$var(X) \doteq \frac{1}{n} \sum_{i=1}^n (x_i - \mu_X)^2$

The estimate of covariance between X and Y is

$cov(X,Y) \doteq \frac{1}{n} \sum_{i=1}^n (x_i - \mu_X) (y_i - \mu_Y)$

The estimate of Pearson coefficient is

$\rho_{X,Y} \doteq \frac{\sum_{i=1}^n (x_i - \mu_X) (y_i - \mu_Y)}{\sqrt{ \sum_{i=1}^n (x_i-\mu_X)^2} \sqrt{\sum_{i=1}^n (y_i - \mu_Y)^2}}$

The denominator is simply a non-negative normalization factor, and only the nominator (the covariance) measures the correlation between X and Y. So let us have a close look at covariance.

Consider a sample of four symmetric bivariate points $\{d_i = \langle x_i, y_i \rangle\}_{i=1}^4$: (We consider bivariate samples here only because they can be shown on the 2D screen.)

It is not hard to verify that the estimate of covariance from this sample is 0. Note that if a point is in the 1st and 3rd orthant, then $(x_i-\mu_X)(y_i-\mu_y)$ is positive, otherwise it is negative. So generally, given a set of points distributed symmetrically around their mean, we have covariance $cov(X,Y) \rightarrow 0$ and thus $\rho_{X,Y}\rightarrow 0$.

To make $cov(X,Y)$ or $\rho_{X,Y}$ larger than 0, we need more points in the 1st and 3rd orthants, for example:

Similarly, to make $cov(X,Y)$ or $\rho_{X,Y}$ less than 0, we need more points in the 2nd and 4th orthants.  However large (or small) the value of $cov(X,Y)$ is, the denominator of $\rho_{X,Y}$ normalizes the result between [-1,1].

The covariance matrix of X and Y is

$C(X,Y) = \begin{bmatrix} var(X), & cov(X,Y) \\ cov(X,Y), & var(Y) \end{bmatrix}$

which contains all the factors used to compute $\rho_{X,Y}$.

An experiment is to construct a covariance matrix with extremely large $cov(X,Y)$ given $var(X)=var(Y)=1$:

$C(X,Y) = \begin{bmatrix} 1, & 1 \\ 1, & 1 \end{bmatrix}$

then we sample 1000 points from a Gaussian distribution $N(d; [0,0], C(X,Y)$, plot the sampling result:

You see, all samples are in the sample line.

If we relax $cov(X,Y)$ to be 0.9, the result is:

If we use negative $cov(X,Y)$, say -0.9,  then we have

$C(X,Y) = \begin{bmatrix} 1, & -0.9 \\ -0.9, & 1 \end{bmatrix}$

and

All above plots have mean slope is either 1 or -1.  How can we have other slope values? The result is to change the bounding box defined by $var(X)$ or $var(Y)$.  For example

$C(X,Y) = \begin{bmatrix} 2, & \sqrt{2} \\ \sqrt{2}, & 1 \end{bmatrix}$

Note that the covariance matrix $C(X,Y)$ must be positive-definite, which, for any non-zero vector z, has $z^T C z>0$. In bivariate case, consider z=[a,b], the constraint is equivalent to

$a^2 var(X) + 2ab\, cov(X,Y) + b^2 var(Y) > 0$

If $cov(X,Y)=\pm\sqrt{var(X)}\sqrt{var(Y)}$, the left-side of above equation can be written as a square-form and is thus $\geq 0$. If we want $> 0$, we need

$|cov(X,Y)| < \sqrt{var(X) var(Y)} = \sigma_X\sigma_Y$

This constraints the Pearson coefficient between [-1,1].  When $\rho_{X,Y}$ is 1 or -1,  all sample points of X and Y are on a line, and we say X and Y are linearly dependent.

Plots in this post were drawn using the following MATLAB command:

M = sample_gaussian([0,0,], [2,sqrt(2);sqrt(2),1], 1000);
cla; scatter(M(:,1), M(:,2)); axis tight


where function sample_gaussian can be found in my previous post.

## When and Why Statisitcal Learning Works

August 28, 2010

This video explains briefly how Google machine translation works (by using statistical learning). It says

A computer can learn foreign language … by referring to vocabulary and a set of rules. But … When you try to capture all of these exceptions, and exceptions to the exceptions, in a computer program, the translation quality begins to break down.

This explains when and why statistical learning might be helpful, compared to human summarized rules, from the perspective of machine translation.

## Mpd Timeout Problem

August 23, 2010

When we run some MPI programs started using mpiexec, MPI system complains:

mpiexec_XiAn-172-26-3-109 (mpiexec 392): no msg recvd from mpd when expecting ack of request

My colleague Rick Jin found the following explanation using Google:

This error may deal with the follwing reasons:Our environment has no high performance network. I larger the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py file and solve this issue.

In mpiexec.py, variable MPIEXEC_RECV_TIMEOUT is set to 20 by default. We can change the value using command line parameter -recvtimeout. For example:

mpiexec -recvtimeout 100


Many thanks to Rick!

## Using CEDET and ECB with Emacs

August 21, 2010

(I tested the following procedure about installing and customizing CEDET for Emacs with GNU Emacs 23 under Ubuntu and Snow Leopard.)

Installation

• Install CEDET
2. Unpack to ~/.emacs.d/cedet (or other places)
3. Compile the package using command
emacs -Q -nw -l cedet-build.el -f cedet-build -f save-buffers-kill-terminal
• Install external tag systems (optionally required by CEDET)
1. GNU Global, and/or
2. exuberent ctags
• Install ECB
2. Unpack into ~/.emacs.d/ecb (or any other directory)

Customization

An excellent tutorial about customizing CEDET is here. The author also published his .emacs file which contains a few flaws. Mine based on his is as follows:

;; CEDET

(global-ede-mode 'nil)                  ; do NOT use project manager

(require 'semantic-ia)          ; names completion and display of tags
(require 'semantic-gcc)         ; auto locate system include files

(require 'semanticdb)
(global-semanticdb-minor-mode 1)

(defun my-cedet-hook ()
(local-set-key [(control return)] 'semantic-ia-complete-symbol)
(local-set-key "\C-c>" 'semantic-complete-analyze-inline)
(local-set-key "\C-c=" 'semantic-decoration-include-visit)
(local-set-key "\C-cj" 'semantic-ia-fast-jump)
(local-set-key "\C-cq" 'semantic-ia-show-doc)
(local-set-key "\C-cs" 'semantic-ia-show-summary)
(local-set-key "\C-cp" 'semantic-analyze-proto-impl-toggle)
(local-set-key "\C-c+" 'semantic-tag-folding-show-block)
(local-set-key "\C-c-" 'semantic-tag-folding-fold-block)
(local-set-key "\C-c\C-c+" 'semantic-tag-folding-show-all)
(local-set-key "\C-c\C-c-" 'semantic-tag-folding-fold-all)
)

(global-semantic-tag-folding-mode 1)

(require 'eassist)

(defun alexott/c-mode-cedet-hook ()
(local-set-key "\C-ct" 'eassist-switch-h-cpp)
(local-set-key "\C-xt" 'eassist-switch-h-cpp)
(local-set-key "\C-ce" 'eassist-list-methods)
(local-set-key "\C-c\C-r" 'semantic-symref)
)

;; gnu global support
(require 'semanticdb-global)
(semanticdb-enable-gnu-global-databases 'c-mode)
(semanticdb-enable-gnu-global-databases 'c++-mode)

;; ctags
(require 'semanticdb-ectag)

(global-semantic-idle-tag-highlight-mode 1)


Using CEDET

1. control return: whatever the symbol you are typing, this hot key automatically complete it for you.
2. C-c?: another way to complete the symbol you are typing
3. C-c>: when you typed . or -> after an object name, use this key to show possible public member functions or data members.
5. C-cs: show a summary about the symbol under cursor
6. C-cq: show the document of the symbol under cursor
7. C-c=: visit the header file under cursor
8. C-cp: toggle between the implementation and a prototype of symbol under cursor
9. C-ce: when your cursor is in the scope of a class or one of its member function, list all methods in the class
10. C-cC-r: show references of the symbol under cursor
11. C-cC-c-: fold all
12. C-cC-c+: unfold all
13. C-c-: fold the block under cursor
14. C-c+: unfold the block under cursor

Using ECB

Once CEDET is installed and customized, we can use ECB. Add the following lines into your .emacs file:

(add-to-list 'load-path "~/.emacs.d/ecb-2.40")
(require 'ecb)


In Emacs, we can use M-x ecb-activate to turn on ECB mode. This page shows how to byte compile ECB to make it runs fast.

## Add HTTP Services into C++ Programs Using PION-NET

August 21, 2010

In the recent four years, I have been working on parallel machine learning systems, where a learning job consists of multiple processes running in parallel on many computers.  These processes may run for a long time (from dozens of minutes up to several days), thus I often like to monitor their status like statistics (e.g., CPU and memory heap consumptions) and progress (how many data it has processed).  Logging can expose status, but usually not in a comprehensive way.  A more elegant solution is to make each process spawn a thread to run a Web service, which exposes the status in the format of HTML.

After a weeks-long survey, I decided to use pion-net, a library built upon Boost::Asio.  A blog post here explains some advantages of Boost::Asio. The building and installation of pion-net is straight-forward; just ./configure && make && make install. If you want to use your own builds of boost and openssl, use ./configure --with-openssl=openssl-prefix-dir --with-boost=boost-prefix-dir.

In the source directory /pion-net-3.0.15/net/utils/ of pion-net, there are examples showing how to use pion::net::TCPServer and pion::net::WebServer. Here follows an additional example for pion::net::HTTPServer:

// Copyright (C) 2010  Yi Wang (yi.wang.2005@gmail.com)
// Demonstrate how to add HTTP service into a Boost/C++ program.

#include <iostream>
#include <pion/net/HTTPServer.hpp>
#include <pion/net/HTTPTypes.hpp>
#include <pion/net/HTTPRequest.hpp>
#include <pion/net/HTTPResponseWriter.hpp>

using namespace std;
using namespace pion;
using namespace pion::net;

void handleRootURI(HTTPRequestPtr& http_request, TCPConnectionPtr& tcp_conn) {
static const std::string kHTMLStart("<html><body>\n");
static const std::string kHTMLEnd("</body></html>\n");

HTTPResponseWriterPtr
writer(HTTPResponseWriter::create(tcp_conn,
*http_request,
boost::bind(&TCPConnection::finish,
tcp_conn)));
HTTPResponse& r = writer->getResponse();
r.setStatusCode(HTTPTypes::RESPONSE_CODE_OK);
r.setStatusMessage(HTTPTypes::RESPONSE_MESSAGE_OK);

HTTPTypes::QueryParams& params = http_request->getQueryParams();

writer->writeNoCopy(kHTMLStart);
writer->write(http_request->getResource());

if (params.size() > 0) {
writer->write(" has the following parameters: <br>");
for (HTTPTypes::QueryParams::const_iterator i = params.begin();
i != params.end(); ++i) {
writer->write(i->first);
writer->write("=");
writer->write(i->second);
writer->write("<br>");
}
} else {
writer->write(" has no parameter.");
}
writer->writeNoCopy(kHTMLEnd);
writer->send();
}

int main (int argc, char *argv[]) {
static const unsigned int DEFAULT_PORT = 8080;
try {
HTTPServerPtr hello_server(new HTTPServer(DEFAULT_PORT));
hello_server->start();      // Spawn a thead to run the HTTP
// service, and the main thread can
// focus on the main work now.
std::cout << "Hello, the server is running now ...\n";
sleep(10);
} catch (std::exception& e) {
std::cerr << "Failed running the HTTP service due to " <<  e.what();
}
return 0;
}


The following Makefile lists libraries required to build this program:

pion_hello_service : pion_hello_service.cc
g++ \
-I/Users/wangyi/3rd-party/pion-net-3.0.15/include \
-I/Users/wangyi/3rd-party/boost-1.43.0/include \
-I/Users/wangyi/3rd-party/openssl-0.9.8o/include \
pion_hello_service.cc \
-L/Users/wangyi/3rd-party/pion-net-3.0.15/lib \
-L/Users/wangyi/3rd-party/boost-1.43.0/lib \
-L/Users/wangyi/3rd-party/openssl-0.9.8o/lib \
-lpion-common -lpion-net \
-lboost_date_time -lboost_signals -lboost_iostreams \
-lssl -lcrypto -lz -lbz2 -ldl


## Parallel LDA Gibbs Sampling Using OpenMP and MPI

August 18, 2010

I created an open source project, ompi-lda, on Google Code.  This project is inspired by another LDA project which I initialized at Google and a recent parallel programming effort using OpenMP by xlvector.

## No Large Local Variables in Functions

August 13, 2010

I had been driven crazy by the following code snippet:

void AFunction() {
char buffer[kBufferSize];
...
}


I built this program using GCC successfully under Cygwin and SUSE Linux. However, on either platform, at the running time, whenever AFunction is invoked, the program crashes without prompt message. Even the core dump does not contain any useful information. This Disaster finished until I changed buffer into a static variable:

void AFunction() {
static char buffer[kBufferSize];
...
}


This core dump is due to overflow of the stack (buffer cost more memory than the stack size). This text explains how to change the default stack size on different platforms:

In general, under Unix-like platforms, the stack size is controlled
by environment variable, not the program itself.
So you cannot pass any flags to the
compilers, like gcc, to setup stack size. Under Windows platforms, the
stack size information is contained in the executable files. It can be set
during compilation in Visual C++, but this is not available in gcc.
Alternatively, Microsoft provides a program “editbin.exe” which can change the
executable files directly. Here are more details:

SunOS/Solaris:
==============
> limit # shows the current stack size
> unlimit # changes the stack size to unlimited
> setenv STACKSIZE 32768 # limits the stack size to 32M bytes

Linux:
======
> ulimit -a # shows the current stack size
> ulimit -s 32768 # sets the stack size to 32M bytes

Windows (during compilation):
=============================
1. Select “Project->Setting”.
3. Select “Category” to “Output”.
4. Type your preferred stack size in “Reserve:” field under “Stack
allocations”. eg, 32768 in decimal or 0×20000 in hexadecimal.

Windows (to modify the executable file):
=======================================
There are two programs included in Microsoft Visual Studio, “dumpbin.exe”
and “editbin.exe”. Run “dumpbin /headers executable_file”, and you can see
the “size of stack reserve” information in “optional header values”. Run
“editbin /STACK:size” to change the default stack size.

## SGI Octane III — Personal Super-Computer

August 9, 2010

If you search for “personal super-computer” using Google, you got two things: NVIDIA’s Telsa GPU card system and SGI’s Octane III. I took a note on the former. Here is a note on the later — a super-computer with (at most) 120 cores in a pedestal box. The data sheet can be downloaded from here. From the data sheet we can see that O3 consists of trays. On each tray there are two sockets, and each socket can have one Intel 5500 or 5600 CPU, where 5500 has 4 cores and 5600 has 6 cores. Each tray has it own memory, disk adapter and network adapter. In short, each tray is a node. One of these nodes is called “head node”, on which installs a GPU card.

The whole system looks like a small racket with many computers installed; just a small copy of the mature design used by Google and other organizations to create their distributed computing environment. But anyway, it is an interesting idea to make such a small racket which can fit in an office or on a desk. According to a post on gizmotastic, the price for the basic Octane III begins at $7995 and tops out at$53000, which would be affordable by many labs and companies.

Here are photos of Octane III. (On Wikipedia, you can find photos of Octane II and Octane I)

## Test an Integer Value is a Power of 2

August 8, 2010

The following code checks whether an integer value is a power of 2:

inline bool IsPowerOf2(int value) {
return (value & (value-1)) == 0;
}


I got this trick from the fast C++ trie code of Jeff Koftinoff. Jeff use the following code to do above check at compile time:

template <bool test_condition>
struct trie_map_compile_time_test {
trie_map_compile_time_test()
{
switch(false)
{
case false:
case test_condition:
break;
}
}
};

template <unsigned int VALUE>
struct trie_map_test_for_power_of_two
:  private trie_map_compile_time_test<((VALUE & (VALUE-1))==0)> {
};


If VALUE is not a power-of-2, trie_map_compile_time_test makes the compiler complain that two switch labels are both “false”.

August 7, 2010

I did a survey on implementations of trie in C++.  However, the result is a little disappointing.

There are two kinds of trie: the regular trie (aka prefix tree), and the Patricia trie (aka radix trie). The difference is that, in contrast with a regular trie, the edges of a Patricia trie are labeled with sequences of characters rather than with single characters.

Using Wikipedia and Google, I found the following C++ implementations of tries, most of them are Patricia tries.

1. (DENIED) Patricia Trie Template Class on CodeProject
Unfortunately, the code is in ugly style. (I do have bias on code style of Microsoft, and coincidently, the author’s profile shows that he is a Microsoft employee.) Worse than that, after hours of refactoring, I found the code does not support using std::string as the key! And, even if it supports using numerical types (e.g., int) as the key, the same set of keys won’t result in tries with the same topology on platforms with different endian.
2. (DENIED) Practical Algorithm Template Library (PATL) – C++ library on PATRICIA trie on Google Code
The code are in pretty good style, but it relies on Visual C++ syntax and Windows types (__int64). The test code and demos rely on Windows libraries.
3. (TESTING) Trie in SRILM
My colleagues suggest me to take a look at the trie implementation (in C++ template) in SRILM, a well-known language model training package.  I successfully extracted trie code from srilm/dstruct/src and make it compilable. But the API is so incomprehensible that I have no idea how to use it.
4. (DENIED) Trie in GAWK
I am interested with how GAWK implements the associative array. But by checking array.c, I think I cannot find a reusable trie, or even just a trie.
5. (NO ITERATOR) C++ Trie Library by Julien Lemoine
This is a implementation of regular trie, instead of Patricia trie. I successfully re-factorized this implementation in Google code style, and build it using GCC. However it lacks an iterator.
6. (DENIED) Trie in XORP
By Googling trie and iterator, I was brought to a document page of XORP, a networking package in C++. There is a trie implementation there, but tailored specifically for IP routining, and not reusable.
7. (GOOD) The Fast Lookup Trie
This is an implementation of regular trie. There is an iterator defined. Good code style, STL compatible API, and claimed extremely fast. But the document claims that the implementation is hungry for memory. This need to be checked. Two other notables: (1) trie_map::begin() cannot return a const_iterator; no idea why; (2) the iterator goes with difference order from that of STL map.
8. (DENIED) GNU STL extension: pb_ds::trie
This is well written, but no much faster than std::map<std::string, int>. A good property of pb_ds::trie::iterator is that it traverses in the same order as std::map::iterator does.

In order to test the running time cost of std::map and 5, 7, 8, I wrote the following code:

#include <sys/time.h>

#include <iostream>
#include <map>
#include <sstream>
#include <string>

#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/trie_policy.hpp>

#include "../base/common.hh"
#include "../libtrie/trie.hh"              // libtrie
#include "../jdktrie/trie_map.hh"          // jdktrie

// Return current time in millisecs with 24 hours reset
inline size_t GetTickCount()
{
struct timeval t;
gettimeofday ( &t, NULL );
t.tv_sec %= 8640; // one day ticks 24*60*60
return ( t.tv_sec*1000 + t.tv_usec/1000 );
}

template <class Accumulator>
size_t AccumulateStringCount(Accumulator& accum) {
size_t start_time = 0;
size_t accum_time = 0;
for (int j = 0; j < 100; ++j) {
for (int i = 0; i < 10000; ++i) {
std::ostringstream o;
o << i;
start_time = GetTickCount();
++accum[o.str()];
accum_time += GetTickCount() - start_time;
}
}
return accum_time;
}

void testPerformance() {
std::map<std::string, int> map_accum;
std::cout << "stl::map costs : "
<< AccumulateStringCount(map_accum)  << " ticks.\n";
pb_ds::trie<std::string, int> stl_trie_accum;
std::cout << "pb_ds::trie costs : "
<< AccumulateStringCount(stl_trie_accum)  << " ticks.\n";
trie::Trie<int> libtrie_accum(-1);
std::cout << "libtrie costs : "
<< AccumulateStringCount(libtrie_accum)  << " ticks.\n";
trie_map<std::string, int> jdk_trie_accum;
std::cout << "jdktrie costs : "
<< AccumulateStringCount(jdk_trie_accum)  << " ticks.\n";
}

int main(int argc, char **argv) {
testPerformance();
return 0;
}


Running this program on my iMac for 100 times, the average number of ticks cost by the 4 methods are listed below:

 STL map STL trie libtrie jdktrie 923.2900 885.3200 527.8900 366.5100

I am going to add trie in srilm into this table. And I need to compare the memory consumption by these methods.