Cuckoo hashing improves SIMD hash tables (and other hash table tradeoffs)

There are many options when designing a hash table. Cuckoo hashing is a curious design that is popular in academia, but unused in some of industry’s most popular designs, such as Google’s Swiss Tables and Meta’s F14 tables. Cuckoo hashing is often avoided because it has worse memory system performance and is beaten by SIMD-accelerated probing.

This doesn’t have to be the case! With careful engineering, you can combine SIMD-accelerated probing with cuckoo hashing to beat the standard implementations in many scenarios.

Benchmark highlights

Adding cuckoo hashing to a strong baseline is typically close to neutral for performance on low load factors¹, and greatly improves performance on high load factors. We show performance across the range of load factors used by Swiss Tables.

For different use cases the best baseline design may also differ; in each case we take the best baseline design and then add cuckoo hashing.

Successful lookups on u64 keys and values, “Direct SIMD” baseline:

Failed lookups on u64 keys and values, “Indirect SIMD” baseline:

Insertions on u64 keys and values, “Indirect SIMD” baseline:

Benchmarks are available. No library release is available yet, but perhaps in future.

What is cuckoo hashing?

All hash tables we consider use open addressing, where keys are stored in a flat array. The designs differ in the probe sequence: which elements, and in what order, we search during a lookup.

Baseline SIMD hash tables such as Swiss Tables use SIMD quadratic probing. This starts searching from a position determined by the hash function, searches 4–16 consecutive keys in parallel with SIMD, then steps by 1 and searches, then steps by 2 and searches, then 3, and so on:

The search continues until either the matching key is found, or an empty entry in the array is found².

In cuckoo hashing, we use two hash functions³, do a single SIMD search at each of those positions, and then stop:

The advantage of cuckoo hashing is that we search only a small fixed number of positions during lookups. The surprising fact about cuckoo hashing is that this probe sequence even works! What if I try to insert a new key and all 8 candidate entries are already full?

The answer: to make space for the new key, you can kick out any of the 8 candidate entries, and move them to one of their 4 alternative locations. For example, here are the alternatives for the entries in the “hash 0” bucket:

If you still don’t find space in any of the $8\times 4=32$ alternative entries, look in their alternatives, and then their alternatives, and so on, in a fanout-4 tree. Once you find an empty location, you can perform a chain of moves—each entry moving to one of its alternative locations—to ultimately make space for the inserted key. With high probability this process completes almost immediately, with just 0 or 1 moves on almost all keys.

When and why does cuckoo hashing win?

In brief: for in-cache tables, cuckoo hashing wins by avoiding branch mispredictions. For out-of-cache tables, large buckets mitigates cuckoo hashing’s disadvantage in memory locality, and then its shorter probe length allows it to win.

Small (in-cache) tables

Quadratic probing performs two equality tests and branches per SIMD probe:

while (true) {
  if (<key matches anything in group>) {
    return <key found>;
  }
  if (<anything in group is empty>) {
    return <key not found>;
  }
  <go to next group>
}

The branches are necessary in quadratic probing because we don’t know in advance how many probes to perform, and at least one of the branches is difficult to predict: it’s nearly random which keys will be at probe position 0 versus probe position 1.

With cuckoo probing we know there are at most two probes, so we can unroll the loop, skip the check for empty, and optionally make the entire process branchless by using conditional move or conditional select instructions⁴. The probing becomes:

<check for match in group 0>
<check for match in group 1>
<return>

For small tables that fit in the CPU cache, the savings in branch mispredictions and in instruction count make cuckoo hashing a win.

Large (out-of-cache) tables, failed lookups

For large tables that don’t fit in the CPU cache, cache line fetches become a more important consideration than branch mispredictions or instruction count. The branchless cuckoo implementation fetches a cache line for each of group 0 and group 1, whereas the branchy cuckoo implementation fetches either 1 or 2 cache lines depending on whether the search terminated after the first or second group. Thus the branchy implementation pays a branch mispredict (sometimes) in order to save a cache line fetch (sometimes). At lower load factors this leads to the branchy implementation beating the branchless implementation.

Whereas in cuckoo hashing the second probe is almost always on a different cache line than its first probe, in quadratic probing they are often on the same cache line, since quadratic probing designs the first two probes to be adjacent. Depending on the layout of the hash table, this can sometimes save cache line traffic. We’ll consider the layouts that we found to be best across different workloads.

One popular design is the Indirect SIMD design from Google’s Swiss Tables, in which there is an array of 1-byte tags and a separate array of (key, value) payloads. SIMD probing is performed on the tag array, and then any matches on the tag array are confirmed by a second (non-SIMD) test against the payload array:

The Indirect SIMD design performs very well for failed lookups (where the key isn’t present in the table), because they can typically be serviced entirely by the tag array without fetching the payload array⁵. Under quadratic probing for Indirect SIMD layout, probe lengths of 1–2 can typically be handled with just one cache line fetch. Under cuckoo probing for Indirect SIMD layout, probe length 2 requires two cache line fetches, which is worse than for quadratic probing⁶.

Because of the extra cache line fetches at probe length 2, this (failed lookups for out-of-cache tables) is the worst case for cuckoo hashing: it only starts to pull ahead as load factors grow past 75%. When the load factors are large, quadratic probing starts to require long probe sequences whereas cuckoo hashing keeps them to at most 2:

Large (out-of-cache) tables, successful lookups

For out-of-cache successful lookups, the Indirect SIMD design of the previous section is not optimal: it requires a minimum of two cache misses per lookup: one for the tag array and one for the payload array. There are better layouts which reduce this to just one in the best case.

For integer keys, I found the best layout to be the Direct SIMD layout, with each cache line containing an array of keys and an array of values⁷:

The advantage of Direct SIMD layout is that probe-length-1 lookups can be serviced with just one cache line fetch. The disadvantage relative to Indirect SIMD layout is that only a small number of keys can be fit into a cache line: for u64 keys and values, we can only fit 4 keys per cache line, whereas the Indirect SIMD layout fits 64 keys per cache line. For short enough probe lengths, Direct SIMD layout wins, but for very long probe lengths, Indirect SIMD layout wins. Overall, Direct SIMD layout tends to win on successful lookups but lose on failed lookups⁸, and this is true for both quadratic probing and cuckoo probing.

Within the context of Direct SIMD layout, cuckoo hashing is a large improvement over quadratic probing, because of the sensitivity of Direct SIMD to probe length. It wins at almost all load factors, by keeping probe lengths short⁹:

Side benefits of cuckoo probing

Cuckoo probing brings several practical benefits besides just performance.

Cuckoo hashing supports arbitrary table sizes, rather than being restricted to power-of-2 table sizes like quadratic probing¹⁰. Depending on the number of elements being stored, this can save up to 2× the memory footprint.

For power-of-2-sized cuckoo tables, the table size can be doubled very efficiently in a completely branchless way without any search process. When building a table without reserving the capacity in advance, this reduces runtime by up to 40%¹¹. It works as follows. When growing the table size from $N$ to $2N$ , take all elements that were stored in bucket $B$ and store them in either bucket $B$ or $B+N$ according to the newly unmasked hash bit. Since bucket $B$ hadn’t overflowed before, neither bucket $B$ nor $B+N$ will overflow afterwards, and so no search is required. This rehashing operation is extremely efficient: you can grow the table with realloc (avoiding the need to have $3N$ memory live simultaneously) and then do a streaming pass over the buckets, with perfect memory system performance for hardware prefetchers (memory is accessed sequentially) and cache line locality (the working set is just two cache lines).

Deleted elements in quadratic probing must leave “tombstone” entries in their place, rather than emptying out the element. This increases probe lengths, and if too many tombstones accumulate the table may need to be rebuilt from scratch to clear out the tombstones. Cuckoo probing does not require tombstones.

Cuckoo hash tables are a great wire format for sending dictionaries. Both JSON and protocol buffers natively support sending dictionaries. As part of deserializing, the receiver must do the work of building the hash table. Typically the sender has already done that work! Why can’t the sender send the exact hash table they had built? With quadratic probing, this opens you up to a denial-of-service attack by a malicious sender: they could construct the hash table such that probe lengths are extremely long and every query takes an extremely long time. With cuckoo probing, this is impossible: probes are bounded at length 2 by construction. Directly sending cuckoo tables on the wire seems plausible and efficient for binary serialization formats like Cap’n Proto or FlatBuffers.

Conclusion

Depending on the use case, different hash table layouts are optimal: whether you prioritize successful lookups versus failed lookups versus inserts; how much you are willing to spend memory to save time; whether you table is in cache or out of cache. Generally we find that, in cases where you care about memory footprint somewhat, i.e. for load factors larger than 50%, the best choice tends to be one that uses cuckoo hashing.

In many real-world scenarios, the simpler algorithm beats the more complex algorithm. Cuckoo hashing appears to be a case where the more complex algorithm is actually worthwhile, provided it is engineered carefully.

Appendix: engineering a high-performance cuckoo probing scheme

Cuckoo hashing is more complex than quadratic probing, and you have to get several details right in order to make cuckoo probing competitive in performance with the best hash tables.

Unaligned buckets

When performing SIMD probing, it is natural to start the probing at a position that is a multiple of the SIMD width, e.g. such that all SIMD loads are aligned on 16-element boundaries. This divides the table into 16-element “buckets”, and is known as a bucketed hash table. This is simple but typically not optimal.

For both quadratic probing and cuckoo probing it is typically better to allow SIMD probes to be unaligned, determined by the initial hash code. This reduces “clustering” effects and lowers average probe length:

For cuckoo hashing there is partial theoretical validation of this approach.

Caveats: unaligned probing has slightly worse insertion times than aligned probing, mostly because the tricks of the following section do not apply. Additionally, unaligned probing is only compatible with Indirect SIMD layout.

Producing two hash functions

An overhead of cuckoo hashing is that you must evaluate two hash functions rather than one, and evaluating hash functions can be expensive. Here are a few options, which are all fast and work well in practice, even though they somewhat violate the “hash function independence” assumptions required by the cuckoo hashing theory.

If you prioritize lookup performance over insertion performance, the fastest choice is to compute the second hash function from the first by hash1 = hash0.rotate_left(32). This takes just 1 instruction and 1 clock cycle on most current CPUs.

If you care about insertion performance and you are using the Indirect SIMD table layout, you’d like to be able to switch between the two hash functions using only the single byte in the tag array, without consulting the metadata array. A good choice recommended by the MemC3 paper is by XOR: hash1 = hash0 ^ hash(tag), where tag is the 1-byte tag. As the paper notes, this construction allows you to go from the current bucket b of an entry to the alternate bucket b2 by b2 = b ^ hash(tag), without needing to know whether b is hash0 or hash1. Many options for computing hash(tag) work, so long as they produce 64-bit outputs rather than 8-bit outputs. A simple option is to compute hash(tag) by lookup in a static 256 x u64 table populated at compile time with random numbers. Another option is tag * MUL for some randomly generated odd number MUL.

If you are using Direct SIMD layout rather than Indirect SIMD layout, you have the whole key available. A simple option is then hash1 = hash0 ^ hash0.rotate_left(32). You can move between buckets b and b2 by b2 = b ^ hash(key).rotate_left(32).

The previous two approaches rely on power-of-2 table size for the “XOR trick” to work. This trick can be generalized to arbitrary table size with a few more instructions. Let N be the table size. Mathematically we’d like to compute

hash1 = (multiply_high(tag_hash, bucket_count) - hash0) % bucket_count

To avoid the expensive modulo (%) operator, we instead lower it to a predicated subtraction; the whole snippet executes in just 4 fast instructions on x86-64 and AArch64.

Efficient search during insertion

When inserting into a cuckoo table and there’s no free space in either bucket, you need to search for a sequence of element moves to make space. The main strategies from the literature are (a) “random walk”: randomly pick an element to evict and recurse, or (b) “breadth first search”: consider evicting all elements in parallel, recursively. Much of the literature tends to imply that random walk is faster in practice, because the search process state can be kept in CPU registers, whereas breadth first search requires maintaining a BFS queue in memory. However, consistent with the second libcuckoo paper, we find that breadth first search is faster. As the paper notes, breadth first search allows fetching many buckets in parallel (improving memory level parallelism), whereas random walk amounts to following a linked list and offers zero memory level parallelism.

The only necessary search state is a BFS queue of bucket pointers. Parent pointers of the search tree don’t need to be explicitly stored; they can be recovered by index arithmetic similar to binary heaps, by observing that the BFS queue stores (two interleaved) level-ordered complete N-ary trees of buckets. Given that the breadth first search almost always terminates within 1–4 nodes of search, the most important optimization for the search is ensuring the setup time is minimized. Using a stack-allocated uninitialized BFS queue achieves that; this makes BFS much less attractive on managed languages like Java that don’t support this functionality.

In addition to being faster, BFS also can sustain higher load factors than random walk does. This saves memory.

The load factor of a hash table is the fraction of the underlying array that is nonempty, i.e. it is the number of filled slots divided by the number of total slots. Tables with smaller load factors are faster, but use more memory.↩︎
For power-of-two-sized tables, the math works out such that quadratic probing eventually searches the entire array before looping back around.↩︎

Cuckoo hashing necessarily uses at least two hash functions, but it’s also possible to use more than two. Increasing the number of hash functions, $k$ , allows you to build tables with higher load factors (the fraction of the table that is populated with keys). However, you can also achieve higher load factors by probing more consecutive elements, $\ell$ , per hash function. The latter has better memory locality, and so we prefer it in practice. The max load factors that cuckoo hashing can support at various $k$ and $\ell$ values are shown here (see e.g. Load Thresholds for Cuckoo Hashing with Overlapping Blocks and many of the papers they cite):

	$k=2$	$k=3$	$k=4$
$\ell=1$	0.5	0.9179	0.9768
$\ell=2$	0.8970	0.9882	0.9982
$\ell=3$	0.9592	0.9973	0.9997
$\ell=4$	0.9804	0.9993	1.0000

↩︎

In Rust, we can ask the compiler to generate such instructions by using std::hint::select_unpredictable. Often compilers are quite unreliable at generating branchless code, but with this hint I found it quite reliable.↩︎
Since the tag array only has 1-byte tags, there will occasionally be a false positive on the tag array, where the probe of the tag array indicates there is a key match, but confirmation against the payload array fails. These cases lose performance, both due to the cache line fetch of the payload array, and the branch mispredict. For 7-bit hashes per tag and SIMD width 16, this happens with probability <13% per SIMD probe.↩︎
A natural idea is to try increasing the cuckoo hashing bucket size, to increase cache locality. For example, instead of probing 16 tags in parallel with SIMD, we could probe 32 or 64 tags: all the ones on the same cache line. However, this seemed to hurt performance relative to 16 tags, at all load factors. My best explanation why is because the increased bucket size means we have higher probability of a false positive match on the tag array, thus increasing the average number of fetches against the payload array.↩︎
For string keys—where I did some experimentation, but less than for integers—it seems like the best layout is interleaved SIMD, which puts a small tag array and a small payload array in each cache line. Meta’s F14 tables use almost exactly this layout, except they work in cache line pairs instead of cache lines.↩︎
Successful lookups almost always have shorter probe lengths than failed lookups. Many of the keys in the table were inserted when the table was almost empty, and so successful lookups will find those keys almost immediately. However, failed lookups always have to search the table when it’s at its current (fullest) load.↩︎
Curiously, cuckoo hashing is not an improvement in the successful-lookup case for the Indirect SIMD design: the Indirect SIMD design has wide enough SIMD width that for successful lookups cuckoo probing doesn’t provide any advantage. However, Direct SIMD is a stronger baseline than Indirect SIMD, and cuckoo probing is an improvement there, because the SIMD width typically is only 4 keys, not 16.↩︎
On power-of-2 tables, you compute bucket number from the hash bits by hash_bits & mask. On arbitrary-sized tables you use multiply_high(hash_bits, num_buckets). On most current CPUs this costs one latency-3 instruction, whereas & mask costs one latency-1 instruction. This is typically almost no impact on hash table throughput, and a small impact on hash table latency.↩︎
See benchmarks. For a theoretical explanation, note that in the course of inserting N elements into a table, we perform N insertion operations and between N and 2N element-rebucket operations. For quadratic probing, element-rebucket is almost as expensive as the first insertion. For cuckoo probing, it’s hugely cheaper because no search is required. Effect sizes are biggest for small in-cache tables where search time dominates. For out-of-cache tables where cache misses dominate the effect sizes are smaller, since rebucketing has good cache locality in both quadratic probing and cuckoo probing.↩︎