Google Protocol Buffers: Online Parsing of .proto File and Related Data Files

This is a Chinese-to-English translation of my original post on Blogger and WordPress.

Google Protocol Buffers provides a highly efficient mechanism for data serialization and parsing. It is used in almost every Google product. The usual way of using Protocol Buffers is as follows: Suppose that our program depends on some (limited number of) protocol messages. We define these messages in one or more .proto files, and translate them into C++ (.pb.h and files) using protoc, the protocol buffer compiler. Then we can use these messages in our C++ programs.

However, there is a not-that-usual way to use protocol buffers: Suppose we have some data files, consisting of records defined by a message in a certain .proto file. We hope to be able to parse and print the data file content, given the .proto file. An example is codex, a utility used by Google engineers everyday. This utility can parse and print any data file given the .proto file that defines protocol messages used in the data file. In order to parse the data file, codex must be able to parse the .proto file.

An ad-hoc solution to codex is as follows:

  1. Implement basic functions of codex (e.g., loading data file, print something) in a .cc file (say,
  2. Invokes protoc to compile the given .proto file, thus we get the related .pb.h and files.
  3. Build together with the .pb.h and files, thus we get a codex program specifically tailored for a given .proto file.

This solution is so ad hoc. It create a codex program for every .proto file.

A general solution is: given all .proto files in the world, we can compile them using protoc, and build with Obviously, this solution is intractable.

Then how can we create codex? The solution is to use some APIs that seems have not been documented yet. Let take a look at a code snippet, which parses an arbitrary .proto file:

#include <google/protobuf/descriptor.h>
#include <google/protobuf/dynamic_message.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <google/protobuf/io/tokenizer.h>
#include <google/protobuf/compiler/parser.h>

// Parsing given .proto file for Descriptor of the given message (by
// name).  The returned message descriptor can be used with a
// DynamicMessageFactory in order to create prototype message and
// mutable messages.  For example:
  DynamicMessageFactory factory;
  const Message* prototype_msg = factory.GetPrototype(message_descriptor);
  const Message* mutable_msg = prototype_msg->New();
void GetMessageTypeFromProtoFile(const string& proto_filename,
                                 FileDescriptorProto* file_desc_proto) {
  using namespace google::protobuf;
  using namespace google::protobuf::io;
  using namespace google::protobuf::compiler;

  FILE* proto_file = fopen(proto_filename.c_str(), "r");
    if (proto_file == NULL) {
      LOG(FATAL) << "Cannot open .proto file: " << proto_filename;

    FileInputStream proto_input_stream(fileno(proto_file));
    Tokenizer tokenizer(&proto_input_stream, NULL);
    Parser parser;
    if (!parser.Parse(&tokenizer, file_desc_proto)) {
      LOG(FATAL) << "Cannot parse .proto file:" << proto_filename;

  // Here we walk around a bug in protocol buffers that
  // |Parser::Parse| does not set name (.proto filename) in
  // file_desc_proto.
  if (!file_desc_proto->has_name()) {

The input to above function is the name of a .proto file. The output is a FileDescriptorProto instance, which contains the parsing result of .proto file. Then we are going to create an instance of a protocol message defined in the .proto file. Given the message instance, we can invoke its virtual member function, ParseFromArray/String, to parse a record in the data file:

// Print contents of a record file with following format:
//   { <int record_size> <KeyValuePair> }
// where KeyValuePair is a proto message defined in mpimr.proto, and
// consists of two string fields: key and value, where key will be
// printed as a text string, and value will be parsed into a proto
// message given as |message_descriptor|.
void PrintDataFile(const string& data_filename,
                   const FileDescriptorProto& file_desc_proto,
                   const string& message_name) {
  const int kMaxRecieveBufferSize = 32 * 1024 * 1024;  // 32MB
  static char buffer[kMaxRecieveBufferSize];

  ifstream input_stream(data_filename.c_str());
  if (!input_stream.is_open()) {
    LOG(FATAL) << "Cannot open data file: " << data_filename;

  google::protobuf::DescriptorPool pool;
  const google::protobuf::FileDescriptor* file_desc =
  if (file_desc == NULL) {
    LOG(FATAL) << "Cannot get file descriptor from file descriptor"
               << file_desc_proto.DebugString();

  const google::protobuf::Descriptor* message_desc =
  if (message_desc == NULL) {
    LOG(FATAL) << "Cannot get message descriptor of message: " << message_name;

  google::protobuf::DynamicMessageFactory factory;
  const google::protobuf::Message* prototype_msg =
  if (prototype_msg == NULL) {
    LOG(FATAL) << "Cannot create prototype message from message descriptor";
  google::protobuf::Message* mutable_msg = prototype_msg->New();
  if (mutable_msg == NULL) {
    LOG(FATAL) << "Failed in prototype_msg->New(); to create mutable message";

  uint32 proto_msg_size; // uint32 is the type used in reocrd files.
  for (;;) {*)&proto_msg_size, sizeof(proto_msg_size));

    if (proto_msg_size > kMaxRecieveBufferSize) {
      LOG(FATAL) << "Failed to read a proto message with size = "
                 << proto_msg_size
                 << ", which is larger than kMaxRecieveBufferSize ("
                 << kMaxRecieveBufferSize << ")."
                 << "You can modify kMaxRecieveBufferSize defined in "
                 << __FILE__;
    }, proto_msg_size);
    if (!input_stream)

    if (!mutable_msg->ParseFromArray(buffer, proto_msg_size)) {
      LOG(FATAL) << "Failed to parse value in KeyValuePair:" << pair.value();

    cout << mutable_msg->DebugString();

  delete mutable_msg;

Above function requires three inputs:

  1. The name of the data file
  2. The FileDescriptorProto instance created by GetMessageTypeFromProtoFile
  3. The name of the protocol message used to define records in the data file. Note that there could be more than one messages defined in a .proto file, so we need to know the name of the exact one.

In above function, from given the FileDescriptorProto instance, we use DescriptorPool to get a FileDescriptor instance, from which, we use FindMessageTypeByName to get the MessageDescriptor instance describing the message that we concern 。The we use DynamicMessageFactory to create a prototype message instance according the MessageDescriptor. Note that the message instance we got is a prototype instance and is immutable. We need to invokes its member function New to create a mutable instance.

Also notice that ParseFromArray needs to know the size of an encoded message. So, in a data file consisting of a series of records, we need to prefix each records (encoded message) by its size (in int32 for example).

Here we show the main function demonstrating how to invoke GetMessageTypeFromProtoFile and PrintDataFile:

int main(int argc, char** argv) {
  string proto_filename, message_name;
  vector<string> data_filenames;
  FileDescriptorProto file_desc_proto;

  ParseCmdLine(argc, argv, &proto_filename, &message_name, &data_filenames);
  GetMessageTypeFromProtoFile(proto_filename, &file_desc_proto);

  for (int i = 0; i < data_filenames.size(); ++i) {
    PrintDataFile(data_filenames[i], file_desc_proto, message_name);
  return 0;
About these ads

20 Responses to Google Protocol Buffers: Online Parsing of .proto File and Related Data Files

  1. [...] Google Protocol Buffers: Online Parsing of .proto File and Related Data Files June 2010 [...]

  2. John Nemo says:

    A very useful post, showing an interesting way in which Protocol Buffers can be used. Thank you.

  3. Quora says:

    Has anyone implemented a parser for .proto files in Java?…

    Google code issue #263 [1] tracks this request for the official project; discussion on the topic arose on issue #247 [2], where Kenton Varda, the maintainer of Protocol Buffers, commented that > After years of arguing against a Java port of the parser,…

  4. Tamika says:

    Hi, I do believe this is a great website.
    I stumbledupon it ;) I may revisit yet again since I saved as a favorite
    it. Money and freedom is the best way to change, may you be rich and continue to
    help other people.

  5. finance help says:

    At this time it sounds like Drupal is the best blogging platform out there right now.
    (from what I’ve read) Is that what you’re using on your blog?

  6. My family members every time say that I am killing my time here at web, but I know I am getting familiarity all the time by reading such fastidious articles.

  7. new mothers says:

    Very nice post. I just stumbled upon your blog and wanted
    to say that I have truly enjoyed surfing around your
    blog posts. In any case I’ll be subscribing to your rss feed and I hope you write again soon!

  8. Thanks for ones marvelous posting! I actually enjoyed reading it, you happen to be a great author.
    I will ensure that I bookmark your blog and may come back in the foreseeable future.
    I want to encourage that you continue your great
    work, have a nice afternoon!

  9. mcm japan says:


  10. mcm japan says:


  11. mcm 日本 says:


  12. mcm 値段 says:


  13. What’s Taking place i’m new to this, I stumbled upon this I’ve found It absolutely helpful and it has helped me out loads. I am hoping to contribute & assist other customers like its aided me. Good job.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 27 other followers

%d bloggers like this: