Container Study

In this post, I’m going to present results from some data structure experiments I did. I also briefly discuss a new search structure I came up with for integer keys.

Test setup

I’ve constructed a toy benchmark involving collections of “Thingies” which are stored in containers, and indexed by an integer key. We are repeatedly iterating over our set of “thinkies”, looking for a particular key, and, for each item found, we’re keeping a running sum over the value of another, arbitrary field. The purpose of the sum is to keep the sneaky compiler from optimizing away our searches. The layout of the thingie is shown below:

struct Thingie
{
    uint32 key;
    char junk[JUNK_SIZE];
    uint32 sum_field;
};

The JUNK_SIZE can be varied in order to create different memory layouts. I can also generate random queries with a given miss rate, where “miss” means “not found”.

You can download the code here.

All experiments are done using MSVC++ 2013 express edition, compiling to x64. Exact compiler settings can be found by examining the project file. My laptop is a Core-i3-4010U with a peak frequency of 1.7GHz. I am using one core. The cache hierarchy for this CPU is: L1 :64K (32K I$, 32K D$, thanks Fabian), L2: 256K, L3: 3M (shared with other core).

All graphs are showing relative running times. Lower is better.

I would welcome any additional data that other people want to provide, and will update this post if I receive any interesting findings from others.

Brief Disclaimer On Key Choice

You might object to my choice of an integer key, but I stand by it. Testing data structures using string keys is a red herring. After all, we’re concerned about performance here, so why would we index our data using something that is expensive to compare when we can just as well use something cheap? In an application with string keys (a file cache, say), we can simply hash the filenames, and fall back to single string comparison in the event of a hit in case we’re worried about collisions. We don’t want the cost of the key comparison to obscure the inherent cost of the data structure itself.  If your reply to this point contains the words “maintenance” “readability” or “premature optimization”, then you and I do not have the same priorities.

Test 1:  When Is Linear Search Good Enough?

The first thing I wanted to do was locate the crossover point at which linear search in a vector becomes worse than the sublinear search offered by the tree containers. The location of the crossover point depends on how frequently we will find the thing we’re looking for. For searches that always find something, it turns out that the crossover point is somewhere around 200 elements. If there are misses, however, the tree containers are at an advantage, since the linear search must touch every vector element. In this case the crossover drops to around 20.

12_0

12_50

264_0

264_50

From this, we can see that linear search is actually not that bad for a lot of cases. If you have a relatively small number of things in your collection, a vector with linear search is reasonable. Naturally, there is that inevitable performance cliff as N get large, but the superior asymptotic performance of set/map takes longer to materialize than you might have guessed.

One interesting note about the std::set. The reason std::set gets trounced so heavily for large structures is because it’s ‘find()’ method does not allow me to substitute a different type for comparision. If I could write find( key), I would be fine. But instead, I need to write find(Thingie(key)), and construction of a temporary Thingie causes measurable overhead.

Test 2:  Building a Better Map

For the next experiment, I wanted to show just how bad the std::map container is. There may exist applications in which the red-black tree is the ideal choice, due to its superior amortized performance in the presence of insertions and deletions. However, for pure search speed, it seems to be quite horrible. It’s great if you don’t want to bother doing anything else, but if we take a little bit more care, it’s possible to do considerably better. Here I compare std::map to a “sortedmap” alternative, inspired by Mike Acton’s dictionary lookup example from this talk.

The idea is that instead of sticking the key and value next to one another, we instead seperate them. We’ll stick the keys into a sorted array that we’ll do binary search on. Once we’ve found a given key, then, and only then, will we fetch the values. Striping the structure in this way ensures that the keys are more likely to be in cache, since they are no longer surrounded by unneeded ‘junk’ bytes.

I also tried a “split map”, in which we use a uint,uint map instead of the vector. As expected, this helps a little bit, but the ‘SortedMap’ solution is superior. The red-black tree is just way too mean to the cache. Here are the results:

map_perf

The need to maintain a sorted array does mean that this data structure is more specialized than a map, and will suffer quite a bit if insertion and deletion is interleaved with queries. A red-black tree might come out ahead for an application where insert/delete/lookup are evenly mixed, but if search speed is the primary concern, we can do much better by specializing.

Test 3: Unordered Map

When I started out on this post, I forgot that there’s a standardized hash table now. Let’s toss that into the mix… Naturally, the results are quite good.

unordered_map

Even here though, there is a little room for improvement. The unordered_map exposes a key/value pair in its interface, which basically requires that the key and value be stored adjacent to one another. It turns out that this is not always the ideal choice. If our objects are large, and our expected miss rate is high, we can use the same key/value seperation trick to eke out a tiny bit more performance.

Still, all things considered, std::unordered_map seems pretty powerful. Can we do better?

Test 4:  Rolling Our Own = Total Domination

Not to be outdone, I decided to take a stab at constructing my own specialized container structure for 32-bit integer keys. There are those who would have you believe that it is futile to roll your own data structures.  Nonsense!  It requires considerably more effort to do so, but there are some serious advantages to be had.

For my data structure, I decided to index the key space using a 4-deep 256-ary tree. At our root node, we’ll examine the most significant byte of a given key and choose the child corresponding to that byte. All keys whose MSB is i will go in the ith child. We’ll repeat this until we got down to the least significant byte, at which point, the leaf nodes will store a pointer to the value with the corresponding key. In principle, we can use any width we like without restricting ourselves to bytes, so I decided to call this a “NibbleTree” instead of “ByteTree”. Besides that, NibbleTree is a more comical name.  I came up with this independently, but it’s quite likely that somebody, somewhere has invented it already.

Update: This thing is also known as a Radix Tree.  Thanks again Fabian 🙂

It has good theoretical characteristics. O(1) insert and lookup. If I’d bothered to implement delete, that would be O(1) as well. It can also support sorted iteration, if we like, though it will be a tad slower than an array.

map_vs_nibbletree

In my benchmarks, it’s fast. It’s really, really, surprisingly fast. Unlike the unordered_map, there is no hashing, and no probing, just memory reads and some bit manipulation. Unlike the trees, we only need to chase a maximum of four pointers, and the ones on top will cache really really well.

The only drawback is size. 256 pointers is 2KB of memory, and we can end up needing a fair number of these. However, the nice thing is that, at least for the uniformly distributed keys I’m testing with, memory is amortized pretty heavily as the structure size increases. For a small number of keys, there’s a lot of waste, but the more keys we add, the more likely it is that the new keys will get plopped into an existing node. In the limit, we average out to about 9 bytes/key, which isn’t bad at all. For comparision, a red-black node requires on the order of 24 bytes (2 child pointers plus key plus color).

nibbletree_mem

So, while the NibbleTree is a memory hog for small N, it’s the ideal choice for big, well-distributed datasets.

Conclusions

A few takehome messages from this little exerise:

1. Linear search is fine for small numbers of elements (20-200). O(lgN) takes a while to kick in.
2. If you must use an STL container, use unordered_map.
3. There is always room for improvement. A specialized data structure can soundly defeat a generic one if you take the time to build it.

3 Comments

  1. Joshua Barczak

    Sorry guys for not approving your comments until now. I get so much spam that I’ve gotten into the habit of not checking.

  2. Michael Bosley

    When inserting into your radix/nibble tree, you should consider how the data changes over time. For example, the low nibbles of pointers usually have the most change, so inserting that at the highest point in the tree will speed searches and lower memory usage.

    For random integers, nibble order won’t matter. For things that are somewhat predictable, like pointers, flipping their byte order may give even more surprising results.

  3. thanks for the test!

    my results:
    N is: 5000 M is: 5000.
    Miss rate: 50.00,
    Dataset size: 60000
    (42211123) map 0.48ms
    (42211123) split_map 0.52ms
    (42211123) sortedmap 0.32ms
    (42211123) unordered-map 0.12ms
    (42211123) split-unordered-map 0.13ms
    (42211123) nibbletree 0.07ms

    Core i5 -2400 @ 3.1GHz (Sandy Bridge)

    BTW: you are using QueryPerfTimer, is this preferable way of measuring the execution? I’ve read that on Windows it’s not easy to reliably measure time.
    But here we have code running on single core/thread… so probably nothing bad should happen.

Comments are closed.