# Learning to Smooth Binomial Data

A large corpus of data in our daily engineering are binomial. There are many ways to smooth such data. My colleague, Rick Jin, recently referred me to a paper: Click-Through Rate Estimation forRare Events in Online Advertising, by Xuerui and others, where a novel solution is provided:

Given data are samples drawn from a binomial distribution, if we put a beta prior distribution upon the binomial parameter, using the conjugacy between beta and binomial, we can integrate out the binomial parameter in the log-likelihood function. Therefore, the log-likelihood becomes a function on beta parameters but not binomial parameters. Maximizing the log-likelihood leads to an optimal choice of beta parameters. According to the definition of beta distribution, the learned beta parameters are the optimal Dirichlet smoothing factor of the binomial data.

This looks a perfect solution! Is it generally optimal? I do not think so. The word “optimal” in above paragraph exists only in the assumption that the beta prior is the correct choice. In fact, as the authors noted, other distributions can also be used as the prior over the binomial parameter. The choice of beta is due to its math property (conjugacy with binomial) leads to efficient computation — but NOT due to it optimally describes the generation process of data.

Anyway, I think this method is worth of trying. Maybe, beta does describes the generation process pretty well.

# Linear Dependence and Statistical Dependence

When we are talking about feature selection, we often refer to methods like MI (mutual information), IG (information gain), Pearson correlation, etc. Generally, these methods can be categorized into two classes: measurements of linear dependence and measurements of statistical dependence. (When we talk about dependence here, we mean that between a feature and a response or between two features.)

Statistical Dependence

MI and IG measures statistical dependence between random variables X and Y. In the special application of feature selection:

1. IG measures the KL-divergence between two probability densities P(X) and P(Y).  If these two functions have shapes look like each other, the divergence is small. This implies that X and Y have the same support (or, say, have the same domain).
2. MI measures the KL-divergence between P(X,Y) and P(X)P(Y).  If X and Y are statistical independent with each other, P(X,Y)=P(X)P(Y), and the divergence is small. MI does not require X and Y have the same support.  It is limited by the requirement of P(X,Y), which is hard to get from data if X and/or Y are continuous variables.

In MI and IG, the computation (of KL-divergence) is conducted on distributions, P(X), P(Y) and P(X,Y), but not on data points directly.  So statistical dependence provides a general view on relationships between distributions.

Linear Dependence

Differ from statistical dependence methods, covariance matrix and Pearson correlation compute relationship between X and Y by using their samples directly.  More details can be found in this previous post. This gives linear dependence methods the ability to check whether samples of X-Y pairs distribute on (or close to) a line.

# Covariance and Correlation

Covariance and correlation are important concepts in learning from high-dimensional data. This Wikipedia article summarizes their commonality and (minor) difference. I have a previous post discussing more details about the relationship between Pearson correlation and covariance matrix.

# Debugging MPI Programs Using GDB

This skill comes from Professor Norm Matloff’s homepage: http://heather.cs.ucdavis.edu/~matloff/pardebug.html. Here I make a digest of it:

• Use the -g option, to retain the symbol table for gdb, during compiling.
• Insert the following code to make it wait for gdb attachment during runtime:
```int DebugWait = 1;
while (DebugWait) ;
```
• Lunch the parallel program using mpiexec, make sure you know those computers on which processes are running.
• Login to a computer, use the following bash commands to get a process id, and start gdb to attach to this process
```ps aux | grep program_name
gdb program_name process_id
```
• Use the following gdb commands to break the waiting and continue executing the process
```set DebugWait = 0
continue
```

# Memory Footprinting Using “top”

top -b -n 30 -d 1 | grep 18241 > mem_stat.txt &
30：采样次数
1：间隔时间
18241：进程号

# Search Programs and Packages in Cygwin

A good explanation on searching programs (packages) in Cygwin distribution from this post:

If you’re looking for a specific program, first try searching the Cygwin
packages at . For programs with short names
(like “top”) it’s usually a good idea to prepend “bin/”, and limit the
suffixes, e.g., search for “bin/top(.exe| |\$)” (the search page supports
arbitrary Perl regular expressions).

For example, the “top” program is in the “procps” package.

# How to Use Berkeley DB

2. Unpack the tarball and change directory into db-5.1.19/build_unix.
3. configure –prefix=… –enable-cxx –enable-stl && make && make install

References

1. Getting Started with Berkeley DB (C++ Edition)
This is a must-read one with code snippets for newcomers.
2. Berkeley DB Programmer’s Reference Guide
This comes with more details. Particularly, Chapter 7 introduces DBSTL with examples.
3. Berkeley DB C++ API Reference
4. Berkeley DB C++ Standard Template Library API Reference

Code Snippets

Utilities

You can use db_dump provided by Berkeley DB to dump the database file generated by the following examples.

Example in C

```#include <stdlib.h>
#include <string.h>

#include <iostream>
#include <iomanip>
#include <string>

#include <db_cxx.h>

using namespace std;

const char* kDatabaseName = "access.db";

int main() {
DB *dbp;
DBT key, data;
int ret, t_ret;

if ((ret = db_create(&dbp, NULL, 0)) != 0) {
cerr << "db_create:" << db_strerror(ret) << endl;
exit (1);
}

if ((ret = dbp->open(dbp, NULL, kDatabaseName, "",
DB_BTREE, DB_CREATE, 0664)) != 0) {
dbp->err(dbp, ret, "%s", kDatabaseName);
goto err;
}

memset(&key, 0, sizeof(key));
memset(&data, 0, sizeof(data));
key.data = (char*)"fruit";
key.size = sizeof("fruit");
data.data = (char*)"apple";
data.size = sizeof("apple");

if ((ret = dbp->put(dbp, NULL, &key, &data, 0)) == 0)
cout << "db: " << (char*)key.data << ": key stored.\n";
else {
dbp->err(dbp, ret, "DB->put");
goto err;
}

if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0)
cout << "db: " << (char*)key.data
<< ": key retrieved: data was " << (char *)data.data << endl;
else {
dbp->err(dbp, ret, "DB->get");
goto err;
}

if ((ret = dbp->del(dbp, NULL, &key, 0)) == 0)
cout << "db: " << (char*)key.data << " key was deleted.\n";
else {
dbp->err(dbp, ret, "DB->del");
goto err;
}

if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0)
cout << "db: " << (char*)key.data << ": key retrieved: data was "
<< (char *)data.data << endl;
else
dbp->err(dbp, ret, "DB->get");

err:
if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
ret = t_ret;

return ret;
}
```

Examples in C++

```#include <stdlib.h>
#include <string.h>

#include <iostream>
#include <iomanip>
#include <string>

#include <db_cxx.h>

using namespace std;

const char* kDatabaseName = "access.db";

int main() {

string fruit("fruit");
string apple("apple");
string orange("orange");

DbEnv env(0);
Db* pdb;

try {
env.set_error_stream(&cerr);
env.open("/Users/wangyi/db", DB_CREATE | DB_INIT_MPOOL, 0);

pdb = new Db(&env, 0);
// If you want to support duplicated records and make duplicated
// records sorted by data, you need to call:
//   pdb->set_flags(DB_DUPSORT);
// Note that only Btree-typed database supports sorted duplicated
// records

// If the database does not exist, create it.  If it exists, clear
// its content after openning.
pdb->open(NULL, "access.db", NULL, DB_BTREE, DB_CREATE | DB_TRUNCATE, 0);

Dbt key(const_cast<char*>(fruit.data()), fruit.size());
Dbt value(const_cast<char*>(apple.data()), apple.size()+1);
pdb->put(NULL, &key, &value, 0);

Dbt value_orange(const_cast<char*>(orange.data()), orange.size()+1);
pdb->put(NULL, &key, &value_orange, 0);

// You need to set ulen and flags=DB_DBT_USERMEM to prevent Dbt
// from allocate its own memory but use the memory provided by you.
char buffer[1024];
Dbt data;
data.set_data(buffer);
data.set_ulen(1024);
data.set_flags(DB_DBT_USERMEM);
if (pdb->get(NULL, &key, &data, 0) == DB_NOTFOUND) {
} else {
cout << "Found: " << buffer << endl;
}

if (pdb != NULL) {
pdb->close(0);
delete pdb;
// You have to close and delete an exisiting handle, then create
// a new one before you can use it to remove a database (file).
pdb = new Db(NULL, 0);
pdb->remove("/Users/wangyi/db/access.db", NULL, 0);
delete pdb;
}
env.close(0);
} catch (DbException& e) {
cerr << "DbException: " << e.what() << endl;
return -1;
} catch (std::exception& e) {
cerr << e.what() << endl;
return -1;
}

return 0;
}
```

Examples in DBSTL

```#include <iostream>
#include <iomanip>
#include <string>

#include <db_cxx.h>
#include <dbstl_map.h>

using namespace std;

const char* kDatabaseName = "access.db";

int main() {

string fruit("fruit");
string apple("apple");
string orange("orange");

DbEnv env(DB_CXX_NO_EXCEPTIONS);
Db* pdb;

try {
env.set_error_stream(&cerr);
env.open("/Users/wangyi/db", DB_CREATE | DB_INIT_MPOOL, 0);

pdb = new Db(&env, DB_CXX_NO_EXCEPTIONS);
pdb->open(NULL, "access.db", NULL, DB_BTREE, DB_CREATE, 0);

// In order to use db_map with an explicitly specified database,
// - the database and its environment must be constructed with
//   DB_CXX_NO_EXCEPTIONS flag set.
// - do not specify DB_TRUNCATE flag in Db::open.
typedef dbstl::db_map<string, string> HugeMap;
HugeMap huge_map(pdb, &env);

for (char c = 0; c < 26; ++c) {
string key, value;
key += (static_cast<char>(c + 'a'));
value += (static_cast<char>('z' - c));
huge_map[key] = value;
}

for (HugeMap::iterator i = huge_map.begin(); i != huge_map.end(); ++i) {
cout << i->first << " : " << i->second << "\n";
}

if (pdb != NULL) {
pdb->close(0);
delete pdb;
}
env.close(0);

} catch (DbException& e) {
cerr << "DbException: " << e.what() << endl;
return -1;
} catch (std::exception& e) {
cerr << e.what() << endl;
return -1;
}

return 0;
}
```

Build Programs

If your program uses only the C API (like the first example program), link against libdb. If it uses the C++ API, link against libdb_cxx. If it uses the DBSTL API, link against libdb_stl.