About C++ Trie Implementation

I did a survey on implementations of trie in C++.  However, the result is a little disappointing.

There are two kinds of trie: the regular trie (aka prefix tree), and the Patricia trie (aka radix trie). The difference is that, in contrast with a regular trie, the edges of a Patricia trie are labeled with sequences of characters rather than with single characters.

Using Wikipedia and Google, I found the following C++ implementations of tries, most of them are Patricia tries.

  1. (DENIED) Patricia Trie Template Class on CodeProject
    Unfortunately, the code is in ugly style. (I do have bias on code style of Microsoft, and coincidently, the author’s profile shows that he is a Microsoft employee.) Worse than that, after hours of refactoring, I found the code does not support using std::string as the key! And, even if it supports using numerical types (e.g., int) as the key, the same set of keys won’t result in tries with the same topology on platforms with different endian.
  2. (DENIED) Practical Algorithm Template Library (PATL) – C++ library on PATRICIA trie on Google Code
    The code are in pretty good style, but it relies on Visual C++ syntax and Windows types (__int64). The test code and demos rely on Windows libraries.
  3. (TESTING) Trie in SRILM
    My colleagues suggest me to take a look at the trie implementation (in C++ template) in SRILM, a well-known language model training package.  I successfully extracted trie code from srilm/dstruct/src and make it compilable. But the API is so incomprehensible that I have no idea how to use it.
  4. (DENIED) Trie in GAWK
    I am interested with how GAWK implements the associative array. But by checking array.c, I think I cannot find a reusable trie, or even just a trie.
  5. (NO ITERATOR) C++ Trie Library by Julien Lemoine
    This is a implementation of regular trie, instead of Patricia trie. I successfully re-factorized this implementation in Google code style, and build it using GCC. However it lacks an iterator.
  6. (DENIED) Trie in XORP
    By Googling trie and iterator, I was brought to a document page of XORP, a networking package in C++. There is a trie implementation there, but tailored specifically for IP routining, and not reusable.
  7. (GOOD) The Fast Lookup Trie
    This is an implementation of regular trie. There is an iterator defined. Good code style, STL compatible API, and claimed extremely fast. But the document claims that the implementation is hungry for memory. This need to be checked. Two other notables: (1) trie_map::begin() cannot return a const_iterator; no idea why; (2) the iterator goes with difference order from that of STL map.
  8. (DENIED) GNU STL extension: pb_ds::trie
    This is well written, but no much faster than std::map<std::string, int>. A good property of pb_ds::trie::iterator is that it traverses in the same order as std::map::iterator does.

In order to test the running time cost of std::map and 5, 7, 8, I wrote the following code:

#include <sys/time.h>

#include <iostream>
#include <map>
#include <sstream>
#include <string>

#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/trie_policy.hpp>

#include "../base/common.hh"
#include "../libtrie/trie.hh"              // libtrie
#include "../jdktrie/trie_map.hh"          // jdktrie

// Return current time in millisecs with 24 hours reset
inline size_t GetTickCount()
{
  struct timeval t;
  gettimeofday ( &t, NULL );
  t.tv_sec %= 8640; // one day ticks 24*60*60
  return ( t.tv_sec*1000 + t.tv_usec/1000 );
}

template <class Accumulator>
size_t AccumulateStringCount(Accumulator& accum) {
  size_t start_time = 0;
  size_t accum_time = 0;
  for (int j = 0; j < 100; ++j) {
    for (int i = 0; i < 10000; ++i) {
      std::ostringstream o;
      o << i;
      start_time = GetTickCount();
      ++accum[o.str()];
      accum_time += GetTickCount() - start_time;
    }
  }
  return accum_time;
}

void testPerformance() {
  std::map<std::string, int> map_accum;
  std::cout << "stl::map costs : "
            << AccumulateStringCount(map_accum)  << " ticks.\n";
  pb_ds::trie<std::string, int> stl_trie_accum;
  std::cout << "pb_ds::trie costs : "
            << AccumulateStringCount(stl_trie_accum)  << " ticks.\n";
  trie::Trie<int> libtrie_accum(-1);
  std::cout << "libtrie costs : "
            << AccumulateStringCount(libtrie_accum)  << " ticks.\n";
  trie_map<std::string, int> jdk_trie_accum;
  std::cout << "jdktrie costs : "
            << AccumulateStringCount(jdk_trie_accum)  << " ticks.\n";
}

int main(int argc, char **argv) {
  testPerformance();
  return 0;
}

Running this program on my iMac for 100 times, the average number of ticks cost by the 4 methods are listed below:

STL map STL trie libtrie jdktrie
923.2900 885.3200 527.8900 366.5100

I am going to add trie in srilm into this table. And I need to compare the memory consumption by these methods.