Encoding and Decoding of The Varint32 Type Defined by Google Protocol Buffers

In this earlier post, we talked about saving multiple protocol messages in a file (or in a network communication stream). The point is, we need to record the size of each message just before the message itself. An easy solution is to format the file as { message_size, message_body }*. The C++ code in this earlier post shows how to save message_size as an integer. However, this is far from a perfect solution for two reasons:

  1. There is a big-endian little-endian difference about int on different platforms.  It is true that most computers now-a-days use Intel CPU and use the little-endian byte order, but we developers do not want to make Intel CPU as a must.
  2. most messages are small and do not need 4 bytes for saving their lengths.  In fact, Google Protocol Buffer document suggests that message should not longer than 1 MB.  And Google Protocol Buffer source code defaults the max message size to be 64MB.

If we are to solve the first problem only, we can adopt the network byte order using the following old and familiar network API:

#include <sys/types.h>
#include <netinet/in.h>
unsigned long htonl(unsigned long host_long)

If we are to solve both problems, a solution is to use the varint32 type provided by Google Protocol Buffer. Usually, varint32 is used in a .proto file. In the rest of this post, we introduce how to encode a C/C++ int32 variable into varint32, and how to decode it. As usual, let us start from a code snippet:

#include <google/protobuf/io/coded_stream.h>
#include "../base/common.hh"

using namespace std;
using namespace google::protobuf::io;

inline const uint8* ReadVarint32FromArray(const uint8* buffer, uint32* value) {
  static const int kMaxVarintBytes = 10;
  static const int kMaxVarint32Bytes = 5;

  // Fast path:  We have enough bytes left in the buffer to guarantee that
  // this read won't cross the end, so we can skip the checks.
  const uint8* ptr = buffer;
  uint32 b;
  uint32 result;

  b = *(ptr++); result  = (b & 0x7F)      ; if (!(b & 0x80)) goto done;
  b = *(ptr++); result |= (b & 0x7F) <<  7; if (!(b & 0x80)) goto done;
  b = *(ptr++); result |= (b & 0x7F) << 14; if (!(b & 0x80)) goto done;
  b = *(ptr++); result |= (b & 0x7F) << 21; if (!(b & 0x80)) goto done;
  b = *(ptr++); result |=  b         << 28; if (!(b & 0x80)) goto done;

  // If the input is larger than 32 bits, we still need to read it all
  // and discard the high-order bits.
  for (int i = 0; i < kMaxVarintBytes - kMaxVarint32Bytes; i++) {
    b = *(ptr++); if (!(b & 0x80)) goto done;
  }

  // We have overrun the maximum size of a varint (10 bytes).  Assume
  // the data is corrupt.
  return NULL;

 done:
  *value = result;
  return ptr;
}

int main() {
  static uint8 buffer[4];
  for (uint32 i = 0; i < 0xffffffff; ++i) {
    uint8* end = CodedOutputStream::WriteVarint32ToArray(i, buffer);
    uint32 value;
    ReadVarint32FromArray(buffer, &value);
    CHECK_EQ(i, value);
  }
  return 0;
}

If you compile and run this program, it outputs nothing. This is right, because this program says something only if its encoding and decoding subroutines work inconsistently.

For each int32 value (from 0 to 0xffffffff), this program encodes it as a varint32, decodes this varint32 back, and checks the decoded value with the original value. It is clear enough to show that CodedOutputStream::WriteVarint32ToArray can do the encoding work, but the decoding work is done by a manually written function, ReadVarint32FromArray. In fact, I did not write this function, instead, I found it from Google Protocol Buffer source code file: coded_stream.cc.

P.S. Google Protocol Buffer document explains how to read and write varint32 types.