Unix Philosophy

It had been a difficult problem to compare Windows and Linux. I had been holding the idea that in the Unix world, people write small and simple programs which work together via standard linkages like pipe and other inter-process communication mechanisms. However, under Windows, people tend to write a huge program which can do everything.

An example is word processing. In the Windows world, we have Microsoft Word, which has a huge number of functions: editing, rendering (WYSIWYG), spell checking, printing, and much more. However, in the Unix world, we use the TeX system, consisting of many programs, each does one simple thing — TeX macro defines basic typesetting functions, LaTeX defines more, Emacs (or any other editor) edits, pdfLaTeX (and other converters) converts sources into PDF or other formats, AUCTeX or Lyx implements WYSIWYG, and etc.
Well, by mentioning above, I think I am not so bad as I see at least the Separation and Composition philosophy of the Unix world. However, there are many more that I have not been able to summarize. Anyway, the lucky thing is a master had summarized them for us, so, please refer to the great book The Art of Unix Programming.

Hierarchical Classification

A 2005 paper, Learning Classifiers Using Hierarchically Structured Class Taxonomies, discussed classification into a taxonomy. My general understand is that this problem can be solved by training a set of binary classifiers as the multi-label classification problem. More details delivered by this paper:

Types of Classification:
  1. Traditional: classify instances into mutually exclusive class labels,
  2. Multi-label: an instance may have more than one labels,
  3. Taxonomic: multi-label and labels are from a hierarchical taxonomy.
Solutions Proposed:
  1. Binarized: train a set of binary classifiers, each for a label in the taxonomy. In classification time, if an instance does not belongs to class C, then no need to check it with classifiers belonging to descendants of C.
  2. Split-based: need to read more to understand this solution.
From the experiment results, it seems that above two solutions have similar performance. And both differs from the bottom-up solution that I saw in Google.

Something New about LDA and HDP

The UC Irvin team have updated their work and published a JMLR 2009 paper: Distributed Algorithms for Topic Models. For HDP, they proposed a greedy approach to matching of new topics. I also like their ways to visualize the training process.

Diane Hu, a PhD student working on latent topic models for musical analysis, recently wrote a tutorial/survey on LDA, Latent Dirichlet Allocation for Text, Images, and Music, which introduced LDA basics as well its extension models for images and music.

SSH on Mac OS X

This article shows how to start an SSH server on Mac OS X, how to set up loginless, and how to tunnelling unsecure protocols over SSH.

The article explains how to change the port number of SSH on Mac OS X. Note that from Mac OS X 10.4, the mechanism for launching sshd changed from using xinetd to launchd, so changing the port number becomes a little harder.

Get Across GFW Using Tor and Bridges

在这个技术博客上,我很少用中文写文章。但是此文似乎只有中国用户需要看到。

作为一个中国网络用户,为了访问我的这个技术博客,我不得不翻墙。在我的iMac和MacBook Pro上,Tor是个很不错的工具。可惜最近不好用了。今晚稍微研究了一下,发现可以通过加入网桥来解决这个问题。

具体的做法是:

  1. 给bridges@torproject.org写信,内容无所谓;稍后,对方会回复一封邮件,其中有几个可用的网桥。
  2. 在Vidalia的界面中选择Settings;在Settings对话框里切换到Network页;选中“My ISP blocks connections to the Tor network”;然后通过“Add a Bridge”输入框一条一条加入邮件中的网桥。可以参见以下抓图:

Generate Core Dump Files

If you want your program generates core dump files (including stack trace) when it encounters a segmentation fault, remember to set the following shell option

ulimit -c unlimited

Before running your program.

Once the core file is generated (say core), we can check the stack trace using GDB:

gdb program_file core

Then type GDB command

where

which will show you the stack trace.

Special Notes for Cygwin

When a process (i.e. foo.exe) cores, a default stackdump foo.exe.stackdump is generated. This stackdump contains (among other things) stack frames and functions addresses. You can make some sense of it by using the `addr2line’ utility, but it’s not as convenient as using a debugger.

Which takes me to the actual useful bit on information in this post. You can instruct Cygwin to start your gdb debugger just in time when an fault occurs or have Cygwin generate a real core dump.

To achieve this, add `error_start=action’ to the Cygwin environment variable. For example, to start gdb:

export CYGWIN="$CYGWIN error_start=gdb -nw %1 %2"

Or to generate core dump:

export CYGWIN="$CYGWIN error_start=dumper -d %1 %2"

A Step-by-Step Tutorial on Autotools

Autotools are so complicated for new users, however, I am lucky this evening and found an excellent step-by-step tutorial. By following it, I packed my Hadoop Streaming wrapper for C++ in few minutes! I would like to donate to the author if s/he wants. 🙂

A C++ MapReduce "Implementation" Basing on Hadoop Streaming

Hadoop has two mechanisms to support using languages other than Java:

  1. Hadoop Pipes, which provides a C++ library pair to support Hadoop programs in C/C++ only, and
  2. Hadoop Streamining, which languages any executable files in map/reduce worker processes, and thus support any languages.
However, in Hadoop 0.20.1, the support to Pipes, known as Java code in package org.apache.hadoop.mapred.pipes have been marked deprecated. So I guess Hadoop 0.20.1 has not port to fully support Pipes. Some other posts in forums also discussed this issue.

So, I would like to turn to use Streaming and C++. Michael G. Noll wrote an excellent tutorial on Streaming using Python, which shows that Streaming is equivalent to invoke your map and reduce program using the following shell command:

cat input_file | map_program | sort | reduce_program

Of couse, as you know, Hadoop runs the shell pipes on a computing cluster in parallel.

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-file ./word_count_mapper -mapper ./word_count_mapper \
-file ./word_count_reducer -reducer ./word_count_reducer \
-input ./input/*.txt -output

Basing on Hadoop Streamming, I wrote a C++ MapReduce wrapper (more precisely, it should be called a MapReduce implementation, but the code is simple when built on Hadoop Streaming, that I feel embarrassed to call it an “implementation”). Anyway, I found it is interesting that this simple wrapper support secondary keys, whereas org.apache.hadoop.mapreduce does not yet. 🙂

I have created a Google Code project to host this simple implementation: Hadoop Streaming MapReduce, and imported the code using the following command line:

svn import hadoop-streaming-mapreduce/ https://hadoop-stream-mapreduce.googlecode.com/svn/trunk -m 'Initial import'

. So you should be able to checkout the code now.

Map-Reduce-Merge for Join Operation

In this SIGMOD 2007 paper: Map-reduce-merge: simplified relational data processing on large clusters, the authors add to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules, and demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

However, I think the newly added merge stage could be implemented using another MapReduce job — the mapper scans over key-value pairs of all lineage and output them identically; the shuffling stage will merge values of different lineage but the same key into reduce inputs; finally, the reducer can do whatever supposed to be done by merger.
Other people told me that there are more ways to do merge using MapReduce model. But above simple solution seems one of the most scalable. In short, if I am going to implement a parallel programming model given the objective to support joining of relational data, I would just implement MapReduce, rather than MapReduce-Merge.