Learning to Smooth Binomial Data

October 28, 2010

A large corpus of data in our daily engineering are binomial. There are many ways to smooth such data. My colleague, Rick Jin, recently referred me to a paper: Click-Through Rate Estimation forRare Events in Online Advertising, by Xuerui and others, where a novel solution is provided:

Given data are samples drawn from a binomial distribution, if we put a beta prior distribution upon the binomial parameter, using the conjugacy between beta and binomial, we can integrate out the binomial parameter in the log-likelihood function. Therefore, the log-likelihood becomes a function on beta parameters but not binomial parameters. Maximizing the log-likelihood leads to an optimal choice of beta parameters. According to the definition of beta distribution, the learned beta parameters are the optimal Dirichlet smoothing factor of the binomial data.

This looks a perfect solution! Is it generally optimal? I do not think so. The word “optimal” in above paragraph exists only in the assumption that the beta prior is the correct choice. In fact, as the authors noted, other distributions can also be used as the prior over the binomial parameter. The choice of beta is due to its math property (conjugacy with binomial) leads to efficient computation — but NOT due to it optimally describes the generation process of data.

Anyway, I think this method is worth of trying. Maybe, beta does describes the generation process pretty well.


Linear Dependence and Statistical Dependence

October 23, 2010

When we are talking about feature selection, we often refer to methods like MI (mutual information), IG (information gain), Pearson correlation, etc. Generally, these methods can be categorized into two classes: measurements of linear dependence and measurements of statistical dependence. (When we talk about dependence here, we mean that between a feature and a response or between two features.)

Statistical Dependence

MI and IG measures statistical dependence between random variables X and Y. In the special application of feature selection:

  1. IG measures the KL-divergence between two probability densities P(X) and P(Y).  If these two functions have shapes look like each other, the divergence is small. This implies that X and Y have the same support (or, say, have the same domain).
  2. MI measures the KL-divergence between P(X,Y) and P(X)P(Y).  If X and Y are statistical independent with each other, P(X,Y)=P(X)P(Y), and the divergence is small. MI does not require X and Y have the same support.  It is limited by the requirement of P(X,Y), which is hard to get from data if X and/or Y are continuous variables.

In MI and IG, the computation (of KL-divergence) is conducted on distributions, P(X), P(Y) and P(X,Y), but not on data points directly.  So statistical dependence provides a general view on relationships between distributions.

Linear Dependence

Differ from statistical dependence methods, covariance matrix and Pearson correlation compute relationship between X and Y by using their samples directly.  More details can be found in this previous post. This gives linear dependence methods the ability to check whether samples of X-Y pairs distribute on (or close to) a line.


Covariance and Correlation

October 23, 2010

Covariance and correlation are important concepts in learning from high-dimensional data. This Wikipedia article summarizes their commonality and (minor) difference. I have a previous post discussing more details about the relationship between Pearson correlation and covariance matrix.


Debugging MPI Programs Using GDB

October 16, 2010

This skill comes from Professor Norm Matloff’s homepage: http://heather.cs.ucdavis.edu/~matloff/pardebug.html. Here I make a digest of it:

  • Use the -g option, to retain the symbol table for gdb, during compiling.
  • Insert the following code to make it wait for gdb attachment during runtime:
    int DebugWait = 1;
    while (DebugWait) ;
    
  • Lunch the parallel program using mpiexec, make sure you know those computers on which processes are running.
  • Login to a computer, use the following bash commands to get a process id, and start gdb to attach to this process
    ps aux | grep program_name
    gdb program_name process_id
    
  • Use the following gdb commands to break the waiting and continue executing the process
    set DebugWait = 0
    continue
    

Memory Footprinting Using “top”

October 14, 2010

王流斌教我的办法:

一种监控进程内存使用方法:
top -b -n 30 -d 1 | grep 18241 > mem_stat.txt &
30:采样次数
1:间隔时间
18241:进程号


Search Programs and Packages in Cygwin

October 14, 2010

A good explanation on searching programs (packages) in Cygwin distribution from this post:

If you’re looking for a specific program, first try searching the Cygwin
packages at . For programs with short names
(like “top”) it’s usually a good idea to prepend “bin/”, and limit the
suffixes, e.g., search for “bin/top(.exe| |$)” (the search page supports
arbitrary Perl regular expressions).

For example, the “top” program is in the “procps” package.


How to Use Berkeley DB

October 10, 2010

Download and Build

  1. Download the source code tarball from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html
  2. Unpack the tarball and change directory into db-5.1.19/build_unix.
  3. configure –prefix=… –enable-cxx –enable-stl && make && make install

References

  1. Getting Started with Berkeley DB (C++ Edition)

    http://download.oracle.com/docs/cd/E17076_02/html/gsg/CXX/index.html

    This is a must-read one with code snippets for newcomers.

  2. Berkeley DB Programmer’s Reference Guide

    http://download.oracle.com/docs/cd/E17076_02/html/programmer_reference/index.html

    This comes with more details. Particularly, Chapter 7 introduces DBSTL with examples.

  3. Berkeley DB C++ API Reference

    http://download.oracle.com/docs/cd/E17076_02/html/api_reference/CXX/frame_main.html

  4. Berkeley DB C++ Standard Template Library API Reference

    http://download.oracle.com/docs/cd/E17076_02/html/api_reference/STL/frame_main.html

Code Snippets

Utilities

You can use db_dump provided by Berkeley DB to dump the database file generated by the following examples.

Example in C

#include <stdlib.h>
#include <string.h>

#include <iostream>
#include <iomanip>
#include <string>

#define HAVE_CXX_STDHEADERS
#include <db_cxx.h>

using namespace std;

const char* kDatabaseName = "access.db";

int main() {
  DB *dbp;
  DBT key, data;
  int ret, t_ret;

  if ((ret = db_create(&dbp, NULL, 0)) != 0) {
    cerr << "db_create:" << db_strerror(ret) << endl;
    exit (1);
  }

  if ((ret = dbp->open(dbp, NULL, kDatabaseName, "",
                       DB_BTREE, DB_CREATE, 0664)) != 0) {
    dbp->err(dbp, ret, "%s", kDatabaseName);
    goto err;
  }

  memset(&key, 0, sizeof(key));
  memset(&data, 0, sizeof(data));
  key.data = (char*)"fruit";
  key.size = sizeof("fruit");
  data.data = (char*)"apple";
  data.size = sizeof("apple");

  if ((ret = dbp->put(dbp, NULL, &key, &data, 0)) == 0)
    cout << "db: " << (char*)key.data << ": key stored.\n";
  else {
    dbp->err(dbp, ret, "DB->put");
    goto err;
  }

  if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0)
    cout << "db: " << (char*)key.data
         << ": key retrieved: data was " << (char *)data.data << endl;
  else {
    dbp->err(dbp, ret, "DB->get");
    goto err;
  }

  if ((ret = dbp->del(dbp, NULL, &key, 0)) == 0)
    cout << "db: " << (char*)key.data << " key was deleted.\n";
  else {
    dbp->err(dbp, ret, "DB->del");
    goto err;
  }

  if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0)
    cout << "db: " << (char*)key.data << ": key retrieved: data was "
         << (char *)data.data << endl;
  else
    dbp->err(dbp, ret, "DB->get");

err:
  if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
    ret = t_ret;

  return ret;
}

Examples in C++

#include <stdlib.h>
#include <string.h>

#include <iostream>
#include <iomanip>
#include <string>

#include <db_cxx.h>

using namespace std;

const char* kDatabaseName = "access.db";

int main() {

  string fruit("fruit");
  string apple("apple");
  string orange("orange");

  DbEnv env(0);
  Db* pdb;

  try {
    env.set_error_stream(&cerr);
    env.open("/Users/wangyi/db", DB_CREATE | DB_INIT_MPOOL, 0);

    pdb = new Db(&env, 0);
    // If you want to support duplicated records and make duplicated
    // records sorted by data, you need to call:
    //   pdb->set_flags(DB_DUPSORT);
    // Note that only Btree-typed database supports sorted duplicated
    // records

    // If the database does not exist, create it.  If it exists, clear
    // its content after openning.
    pdb->open(NULL, "access.db", NULL, DB_BTREE, DB_CREATE | DB_TRUNCATE, 0);

    Dbt key(const_cast<char*>(fruit.data()), fruit.size());
    Dbt value(const_cast<char*>(apple.data()), apple.size()+1);
    pdb->put(NULL, &key, &value, 0);

    Dbt value_orange(const_cast<char*>(orange.data()), orange.size()+1);
    pdb->put(NULL, &key, &value_orange, 0);

    // You need to set ulen and flags=DB_DBT_USERMEM to prevent Dbt
    // from allocate its own memory but use the memory provided by you.
    char buffer[1024];
    Dbt data;
    data.set_data(buffer);
    data.set_ulen(1024);
    data.set_flags(DB_DBT_USERMEM);
    if (pdb->get(NULL, &key, &data, 0) == DB_NOTFOUND) {
      cerr << "Not found" << endl;
    } else {
      cout << "Found: " << buffer << endl;
    }

    if (pdb != NULL) {
      pdb->close(0);
      delete pdb;
      // You have to close and delete an exisiting handle, then create
      // a new one before you can use it to remove a database (file).
      pdb = new Db(NULL, 0);
      pdb->remove("/Users/wangyi/db/access.db", NULL, 0);
      delete pdb;
    }
    env.close(0);
  } catch (DbException& e) {
    cerr << "DbException: " << e.what() << endl;
    return -1;
  } catch (std::exception& e) {
    cerr << e.what() << endl;
    return -1;
  }

  return 0;
}

Examples in DBSTL

#include <iostream>
#include <iomanip>
#include <string>

#include <db_cxx.h>
#include <dbstl_map.h>

using namespace std;

const char* kDatabaseName = "access.db";

int main() {

  string fruit("fruit");
  string apple("apple");
  string orange("orange");

  DbEnv env(DB_CXX_NO_EXCEPTIONS);
  Db* pdb;

  try {
    env.set_error_stream(&cerr);
    env.open("/Users/wangyi/db", DB_CREATE | DB_INIT_MPOOL, 0);

    pdb = new Db(&env, DB_CXX_NO_EXCEPTIONS);
    pdb->open(NULL, "access.db", NULL, DB_BTREE, DB_CREATE, 0);

    // In order to use db_map with an explicitly specified database,
    // - the database and its environment must be constructed with
    //   DB_CXX_NO_EXCEPTIONS flag set.
    // - do not specify DB_TRUNCATE flag in Db::open.
    typedef dbstl::db_map<string, string> HugeMap;
    HugeMap huge_map(pdb, &env);

    for (char c = 0; c < 26; ++c) {
      string key, value;
      key += (static_cast<char>(c + 'a'));
      value += (static_cast<char>('z' - c));
      huge_map[key] = value;
    }

    for (HugeMap::iterator i = huge_map.begin(); i != huge_map.end(); ++i) {
      cout << i->first << " : " << i->second << "\n";
    }

    if (pdb != NULL) {
      pdb->close(0);
      delete pdb;
    }
    env.close(0);

  } catch (DbException& e) {
    cerr << "DbException: " << e.what() << endl;
    return -1;
  } catch (std::exception& e) {
    cerr << e.what() << endl;
    return -1;
  }

  return 0;
}

Build Programs

If your program uses only the C API (like the first example program), link against libdb. If it uses the C++ API, link against libdb_cxx. If it uses the DBSTL API, link against libdb_stl.


The Probability Distribution of Gaussian Parameters

October 6, 2010

In my previous post, I noted the general derivation process of confidence interval of Gaussian parameters. This post specifically notes the proof of joint probability distribution of Gaussian parameters at p.29 of MIT Statistics lecture notes no.5.

An orthogonal transformation is a square matrix, whose each row is a basis vector with length 1.  As basis vectors are perpendicular to each other, when we multiply an orthogonal transformation with a vector x, x is projects to each basis vector.

This just re-represent x in another axis system (defined by rows of the transformation); between the new and the old axis systems share the same original point, orthogonal transformation does not change the length of x.  This property is used in (5.0.2).

Imaging that we have a set of  points (vectors) sampled from a Gaussian distribution with diagonal covariance matrix.  Since such a distribution has sphere contour, orthogonal transformation of the contour (or equivalently, the sample points) does not change the shape of the sphere contour. In other words, the transformed random variable (or the transformed Gaussian distribution) is still a Gaussian with diagonal covariance matrix, whose variates are independent to each other.  This property is used in the last paragraph of the proof.


Follow

Get every new post delivered to your Inbox.

Join 28 other followers