如何在一个文件中存储多个 Protocol Messages

今天有几个同事聊到这个问题。这个问题的答案在 Google Protocol Buffers 的项目文档中有介绍

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

如果要在一个文件中存储多个 messages,程序员需要存储每个message的起止位置。或者存储每个message的大小也能达到同样的效果。

有一个同事提到了 Google Protocol Buffers 的一个 API:

Message::ParseFromIstream(const std::istream*);

从这个名字看我也觉得应该用它试试,所以写了下面这个测试程序:

#include
#include
#include "learn-pb.pb.h"

using namespace std;

int main() {
  static int kCount = 5;

  KeyValuePair kv;
  kv.set_key("hello");
  kv.set_value("word");

  ofstream os("/tmp/a");
  os.write(reinterpret_cast(&kCount), sizeof(kCount));
  for (int i = 0; i < kCount; ++i) {
    kv.SerializeToOstream(&os);
  }
  os.close();

  ifstream is("/tmp/a");
  int count;
  is.read(reinterpret_cast(&count), sizeof(count));
  for (int i = 0; i < count; ++i) {
    cout << "Reading the " << i << "th message from file ... \n";
    kv.Clear();
    if (!kv.ParseFromIstream(&is)) {
      cerr << "Failed in parsing protocol message.\n";
    }
    if (kv.key() != "hello") {
      cerr << "kv.key() = " << kv.key() << ", but real key is hello.\n";
    }
    if (kv.value() != "word") {
      cerr << "kv.value() = " << kv.value() << ", but real value is word.\n";
    }
  }
  is.close();

  return 0;
}

其中用到的 learn-pb.proto 文件内容如下:

message KeyValuePair {
  optional string key = 1;
  optional string value = 2;
}

这个程序的运行结果如下:

$ ./learn-pb.exe
Reading the 0th message from file ...
Reading the 1th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.
Reading the 2th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.
Reading the 3th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.
Reading the 4th message from file ...
kv.key() = , but real key is hello.
kv.value() = , but real value is word.

这说明 ParseFromIstream 只成功的解析了文件中的第一个 message。而后面所有message的解析都失败了。

在一个文件中连续存储多个message的正确的方法是:在每个message之前存储这个message的大小(作为一个整数)。读取的时候,先读取这个整数,然后读取相应个数的字节,最后调用 Message::ParseFromArray 从这段字节序列中解析一个message。这个过程是可以重复的,从而访问文件中一个序列的多个messages。