Hadoop has two mechanisms to support using languages other than Java:
- Hadoop Pipes, which provides a C++ library pair to support Hadoop programs in C/C++ only, and
- Hadoop Streamining, which languages any executable files in map/reduce worker processes, and thus support any languages.
So, I would like to turn to use Streaming and C++. Michael G. Noll wrote an excellent tutorial on Streaming using Python, which shows that Streaming is equivalent to invoke your map and reduce program using the following shell command:
cat input_file | map_program | sort | reduce_program
Of couse, as you know, Hadoop runs the shell pipes on a computing cluster in parallel.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-file ./word_count_mapper -mapper ./word_count_mapper \
-file ./word_count_reducer -reducer ./word_count_reducer \
-input ./input/*.txt -output
Basing on Hadoop Streamming, I wrote a C++ MapReduce wrapper (more precisely, it should be called a MapReduce implementation, but the code is simple when built on Hadoop Streaming, that I feel embarrassed to call it an “implementation”). Anyway, I found it is interesting that this simple wrapper support secondary keys, whereas org.apache.hadoop.mapreduce does not yet. 🙂
svn import hadoop-streaming-mapreduce/ https://hadoop-stream-mapreduce.googlecode.com/svn/trunk -m 'Initial import'
. So you should be able to checkout the code now.