I Was Wrong: Hadoop Pipes Is (In-)Compatible with Protocol Buffers

Few weeks ago, I published this post to say that Hadoop Pipes is incompatible with Google Protocol Buffers. However, I was wrong. Thanks to Christopher Smith, who kindly pointed out my mistake and posted the solution to my problem — the mapper should be defined as:

void map(HadoopPipes::MapContext& context) {
  const char valueString[] = "apple\norange\0banana\tpapaya";
  context.emit("", std::string(valueString, sizeof(valueString) - 1));
}

My original post is as follows:

I just found another reason that I do not like Hadoop Pipes — I cannot use a serialization of Google protocol buffer as map output key or value.

For those who are scratching your heads for weird bugs from your Hadoop Pipes programs using Google protocol buffers, please have a look at the following sample program:

#include <string>
#include <hadoop/Pipes.hh>
#include <hadoop/TemplateFactory.hh>
#include <hadoop/StringUtils.hh>

using namespace std;

class LearnMapOutputMapper: public HadoopPipes::Mapper {
public:
  LearnMapOutputMapper(HadoopPipes::TaskContext& context){}
  void map(HadoopPipes::MapContext& context) {
    context.emit("", "apple\norange\0banana\tpapaya");
  }
};

class LearnMapOutputReducer: public HadoopPipes::Reducer {
public:
  LearnMapOutputReducer(HadoopPipes::TaskContext& context){}
  void reduce(HadoopPipes::ReduceContext& context) {
    while (context.nextValue()) {
      string value = context.getInputValue(); // Copy content
      context.emit(context.getInputKey(), HadoopUtils::toString(value.size()));
    }
  }
};

int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory<LearnMapOutputMapper,
                              LearnMapOutputReducer>());
}

The reducer outputs the size of the map output values, which contains special characters: new-line, null-term and tab. If Hadoop Pipes allows such special characters, then we should see reduce outputs 26, the length of string

apple\norangebanana\tpapaya”.

However, unfortunately, we see 12 in the output, which is the length of string

apple\norange”

This shows that map outputs in Hadoop Pipes cannot contain the null-term character, which, however, may appear in a serialization of protocol buffer, as explained in the protocol buffers encoding scheme at:
http://code.google.com/apis/protocolbuffers/docs/encoding.html
I hate Hadoop Pipes, a totally incomplete but released MapReduce API.