It had been a difficult problem to compare Windows and Linux. I had been holding the idea that in the Unix world, people write small and simple programs which work together via standard linkages like pipe and other inter-process communication mechanisms. However, under Windows, people tend to write a huge program which can do everything.
Monthly Archives: January 2010
Hierarchical Classification
A 2005 paper, Learning Classifiers Using Hierarchically Structured Class Taxonomies, discussed classification into a taxonomy. My general understand is that this problem can be solved by training a set of binary classifiers as the multi-label classification problem. More details delivered by this paper:
- Traditional: classify instances into mutually exclusive class labels,
- Multi-label: an instance may have more than one labels,
- Taxonomic: multi-label and labels are from a hierarchical taxonomy.
- Binarized: train a set of binary classifiers, each for a label in the taxonomy. In classification time, if an instance does not belongs to class C, then no need to check it with classifiers belonging to descendants of C.
- Split-based: need to read more to understand this solution.
Something New about LDA and HDP
The UC Irvin team have updated their work and published a JMLR 2009 paper: Distributed Algorithms for Topic Models. For HDP, they proposed a greedy approach to matching of new topics. I also like their ways to visualize the training process.
SSH on Mac OS X
This article shows how to start an SSH server on Mac OS X, how to set up loginless, and how to tunnelling unsecure protocols over SSH.
The article explains how to change the port number of SSH on Mac OS X. Note that from Mac OS X 10.4, the mechanism for launching sshd changed from using xinetd to launchd, so changing the port number becomes a little harder.
Get Across GFW Using Tor and Bridges
在这个技术博客上,我很少用中文写文章。但是此文似乎只有中国用户需要看到。
作为一个中国网络用户,为了访问我的这个技术博客,我不得不翻墙。在我的iMac和MacBook Pro上,Tor是个很不错的工具。可惜最近不好用了。今晚稍微研究了一下,发现可以通过加入网桥来解决这个问题。
具体的做法是:
Maximum Entropy Modeling
Generate Core Dump Files
If you want your program generates core dump files (including stack trace) when it encounters a segmentation fault, remember to set the following shell option
ulimit -c unlimited
Before running your program.
Once the core file is generated (say core), we can check the stack trace using GDB:
gdb program_file core
Then type GDB command
where
which will show you the stack trace.
Special Notes for Cygwin
When a process (i.e. foo.exe) cores, a default stackdump foo.exe.stackdump is generated. This stackdump contains (among other things) stack frames and functions addresses. You can make some sense of it by using the `addr2line’ utility, but it’s not as convenient as using a debugger.
Which takes me to the actual useful bit on information in this post. You can instruct Cygwin to start your gdb debugger just in time when an fault occurs or have Cygwin generate a real core dump.
To achieve this, add `error_start=action’ to the Cygwin environment variable. For example, to start gdb:
export CYGWIN="$CYGWIN error_start=gdb -nw %1 %2"
Or to generate core dump:
export CYGWIN="$CYGWIN error_start=dumper -d %1 %2"
A Step-by-Step Tutorial on Autotools
A C++ MapReduce "Implementation" Basing on Hadoop Streaming
Hadoop has two mechanisms to support using languages other than Java:
- Hadoop Pipes, which provides a C++ library pair to support Hadoop programs in C/C++ only, and
- Hadoop Streamining, which languages any executable files in map/reduce worker processes, and thus support any languages.
So, I would like to turn to use Streaming and C++. Michael G. Noll wrote an excellent tutorial on Streaming using Python, which shows that Streaming is equivalent to invoke your map and reduce program using the following shell command:
cat input_file | map_program | sort | reduce_program
Of couse, as you know, Hadoop runs the shell pipes on a computing cluster in parallel.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-file ./word_count_mapper -mapper ./word_count_mapper \
-file ./word_count_reducer -reducer ./word_count_reducer \
-input ./input/*.txt -output
Basing on Hadoop Streamming, I wrote a C++ MapReduce wrapper (more precisely, it should be called a MapReduce implementation, but the code is simple when built on Hadoop Streaming, that I feel embarrassed to call it an “implementation”). Anyway, I found it is interesting that this simple wrapper support secondary keys, whereas org.apache.hadoop.mapreduce does not yet. 🙂
svn import hadoop-streaming-mapreduce/ https://hadoop-stream-mapreduce.googlecode.com/svn/trunk -m 'Initial import'
. So you should be able to checkout the code now.
Map-Reduce-Merge for Join Operation
In this SIGMOD 2007 paper: Map-reduce-merge: simplified relational data processing on large clusters, the authors add to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules, and demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.