A C++ MapReduce "Implementation" Basing on Hadoop Streaming

Hadoop has two mechanisms to support using languages other than Java:

  1. Hadoop Pipes, which provides a C++ library pair to support Hadoop programs in C/C++ only, and
  2. Hadoop Streamining, which languages any executable files in map/reduce worker processes, and thus support any languages.
However, in Hadoop 0.20.1, the support to Pipes, known as Java code in package org.apache.hadoop.mapred.pipes have been marked deprecated. So I guess Hadoop 0.20.1 has not port to fully support Pipes. Some other posts in forums also discussed this issue.

So, I would like to turn to use Streaming and C++. Michael G. Noll wrote an excellent tutorial on Streaming using Python, which shows that Streaming is equivalent to invoke your map and reduce program using the following shell command:

cat input_file | map_program | sort | reduce_program

Of couse, as you know, Hadoop runs the shell pipes on a computing cluster in parallel.

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-file ./word_count_mapper -mapper ./word_count_mapper \
-file ./word_count_reducer -reducer ./word_count_reducer \
-input ./input/*.txt -output

Basing on Hadoop Streamming, I wrote a C++ MapReduce wrapper (more precisely, it should be called a MapReduce implementation, but the code is simple when built on Hadoop Streaming, that I feel embarrassed to call it an “implementation”). Anyway, I found it is interesting that this simple wrapper support secondary keys, whereas org.apache.hadoop.mapreduce does not yet.🙂

I have created a Google Code project to host this simple implementation: Hadoop Streaming MapReduce, and imported the code using the following command line:
svn import hadoop-streaming-mapreduce/ https://hadoop-stream-mapreduce.googlecode.com/svn/trunk -m 'Initial import'

. So you should be able to checkout the code now.