Google Protocol Buffers: Online Parsing of .proto File and Related Data Files

This is a Chinese-to-English translation of my original post on Blogger and WordPress.

Google Protocol Buffers provides a highly efficient mechanism for data serialization and parsing. It is used in almost every Google product. The usual way of using Protocol Buffers is as follows: Suppose that our program depends on some (limited number of) protocol messages. We define these messages in one or more .proto files, and translate them into C++ (.pb.h and files) using protoc, the protocol buffer compiler. Then we can use these messages in our C++ programs.

However, there is a not-that-usual way to use protocol buffers: Suppose we have some data files, consisting of records defined by a message in a certain .proto file. We hope to be able to parse and print the data file content, given the .proto file. An example is codex, a utility used by Google engineers everyday. This utility can parse and print any data file given the .proto file that defines protocol messages used in the data file. In order to parse the data file, codex must be able to parse the .proto file.

An ad-hoc solution to codex is as follows:

  1. Implement basic functions of codex (e.g., loading data file, print something) in a .cc file (say,
  2. Invokes protoc to compile the given .proto file, thus we get the related .pb.h and files.
  3. Build together with the .pb.h and files, thus we get a codex program specifically tailored for a given .proto file.

This solution is so ad hoc. It create a codex program for every .proto file.

A general solution is: given all .proto files in the world, we can compile them using protoc, and build with Obviously, this solution is intractable.

Then how can we create codex? The solution is to use some APIs that seems have not been documented yet. Let take a look at a code snippet, which parses an arbitrary .proto file:

#include <google/protobuf/descriptor.h>
#include <google/protobuf/dynamic_message.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <google/protobuf/io/tokenizer.h>
#include <google/protobuf/compiler/parser.h>

// Parsing given .proto file for Descriptor of the given message (by
// name).  The returned message descriptor can be used with a
// DynamicMessageFactory in order to create prototype message and
// mutable messages.  For example:
  DynamicMessageFactory factory;
  const Message* prototype_msg = factory.GetPrototype(message_descriptor);
  const Message* mutable_msg = prototype_msg->New();
void GetMessageTypeFromProtoFile(const string& proto_filename,
                                 FileDescriptorProto* file_desc_proto) {
  using namespace google::protobuf;
  using namespace google::protobuf::io;
  using namespace google::protobuf::compiler;

  FILE* proto_file = fopen(proto_filename.c_str(), "r");
    if (proto_file == NULL) {
      LOG(FATAL) << "Cannot open .proto file: " << proto_filename;

    FileInputStream proto_input_stream(fileno(proto_file));
    Tokenizer tokenizer(&proto_input_stream, NULL);
    Parser parser;
    if (!parser.Parse(&tokenizer, file_desc_proto)) {
      LOG(FATAL) << "Cannot parse .proto file:" << proto_filename;

  // Here we walk around a bug in protocol buffers that
  // |Parser::Parse| does not set name (.proto filename) in
  // file_desc_proto.
  if (!file_desc_proto->has_name()) {

The input to above function is the name of a .proto file. The output is a FileDescriptorProto instance, which contains the parsing result of .proto file. Then we are going to create an instance of a protocol message defined in the .proto file. Given the message instance, we can invoke its virtual member function, ParseFromArray/String, to parse a record in the data file:

// Print contents of a record file with following format:
//   { <int record_size> <KeyValuePair> }
// where KeyValuePair is a proto message defined in mpimr.proto, and
// consists of two string fields: key and value, where key will be
// printed as a text string, and value will be parsed into a proto
// message given as |message_descriptor|.
void PrintDataFile(const string& data_filename,
                   const FileDescriptorProto& file_desc_proto,
                   const string& message_name) {
  const int kMaxRecieveBufferSize = 32 * 1024 * 1024;  // 32MB
  static char buffer[kMaxRecieveBufferSize];

  ifstream input_stream(data_filename.c_str());
  if (!input_stream.is_open()) {
    LOG(FATAL) << "Cannot open data file: " << data_filename;

  google::protobuf::DescriptorPool pool;
  const google::protobuf::FileDescriptor* file_desc =
  if (file_desc == NULL) {
    LOG(FATAL) << "Cannot get file descriptor from file descriptor"
               << file_desc_proto.DebugString();

  const google::protobuf::Descriptor* message_desc =
  if (message_desc == NULL) {
    LOG(FATAL) << "Cannot get message descriptor of message: " << message_name;

  google::protobuf::DynamicMessageFactory factory;
  const google::protobuf::Message* prototype_msg =
  if (prototype_msg == NULL) {
    LOG(FATAL) << "Cannot create prototype message from message descriptor";
  google::protobuf::Message* mutable_msg = prototype_msg->New();
  if (mutable_msg == NULL) {
    LOG(FATAL) << "Failed in prototype_msg->New(); to create mutable message";

  uint32 proto_msg_size; // uint32 is the type used in reocrd files.
  for (;;) {*)&proto_msg_size, sizeof(proto_msg_size));

    if (proto_msg_size > kMaxRecieveBufferSize) {
      LOG(FATAL) << "Failed to read a proto message with size = "
                 << proto_msg_size
                 << ", which is larger than kMaxRecieveBufferSize ("
                 << kMaxRecieveBufferSize << ")."
                 << "You can modify kMaxRecieveBufferSize defined in "
                 << __FILE__;
    }, proto_msg_size);
    if (!input_stream)

    if (!mutable_msg->ParseFromArray(buffer, proto_msg_size)) {
      LOG(FATAL) << "Failed to parse value in KeyValuePair:" << pair.value();

    cout << mutable_msg->DebugString();

  delete mutable_msg;

Above function requires three inputs:

  1. The name of the data file
  2. The FileDescriptorProto instance created by GetMessageTypeFromProtoFile
  3. The name of the protocol message used to define records in the data file. Note that there could be more than one messages defined in a .proto file, so we need to know the name of the exact one.

In above function, from given the FileDescriptorProto instance, we use DescriptorPool to get a FileDescriptor instance, from which, we use FindMessageTypeByName to get the MessageDescriptor instance describing the message that we concern 。The we use DynamicMessageFactory to create a prototype message instance according the MessageDescriptor. Note that the message instance we got is a prototype instance and is immutable. We need to invokes its member function New to create a mutable instance.

Also notice that ParseFromArray needs to know the size of an encoded message. So, in a data file consisting of a series of records, we need to prefix each records (encoded message) by its size (in int32 for example).

Here we show the main function demonstrating how to invoke GetMessageTypeFromProtoFile and PrintDataFile:

int main(int argc, char** argv) {
  string proto_filename, message_name;
  vector<string> data_filenames;
  FileDescriptorProto file_desc_proto;

  ParseCmdLine(argc, argv, &proto_filename, &message_name, &data_filenames);
  GetMessageTypeFromProtoFile(proto_filename, &file_desc_proto);

  for (int i = 0; i < data_filenames.size(); ++i) {
    PrintDataFile(data_filenames[i], file_desc_proto, message_name);
  return 0;