As far as new instruction sets go, AVX has gotten a lot of attention, but there are also a whole bunch of relatively obscure bit manipulation instructions in the latest CPUs, which are really handy if you can get at them. A lot of things that used to be rather expensive are now almost free. Some of these are exclusive to Intel’s Haswell at the moment, but others are also available on earlier CPUs, in particular the next-gen consoles.
For more information, wikipedia is your friend.
In this post, I’m going to touch on a few of the tricks that these instructions enable. Note that there are better ways to do a lot of these. My examples are for illustrative purposes only. Even if we use all of the tricks in the bithacks repertoire, they’ll still turn out to be considerably slower than the BMI instructions.
Also, as Donald Knuth famously said “beware of bugs in the following code, I have only proved it correct, not tried it.”
Leading Zero Count
Suppose you want to know the floor of the base two log of a number. For example, perhaps you have a texture and need to know how many mip levels you’ll have. Assuming a full mip chain, the number of additional mips is the log of the texture dimension. (4×4 has two, 8×8 has three, and so on).
To do something like this using C, you might write:
uint log2(uint x) { uint l=0; while( x ) { x = x>>1; l++; } return l-1; } |
We can do it in two operations using the ‘leading zero count’ instruction:
uint log2( uint x ) { return 31 - _lzcnt_u32(x); } |
Enumerating Bits
It’s very common to use bit-fields as a very compact set data structure. If we have a collection of N objects, and we want to know which of them are members of some set, we can represent set membership using a bit field. We can then do set operations quite easily. A bitwise or gives us the union of two sets, a bitwise and gives us the intersection. Set difference can be done by anding one set with the complement of the other. However, enumerating the set of active bits in a bit field requires something like this ugliness:
uint GetSetBits( uint* indices, uint x ) { uint count=0; uint i=0; while( x) { if( x & 1 ) indices[count++] = i; x = x >> 1; i++; } } |
Newer CPUs can use a much more efficient loop that iterates only once per set bit:
uint GetSetBits( uint* indices, uint x ) { uint count=0; while( x ) { uint k = _tzcnt_u32(x); x = _blsr_u32(x); // apparently x & (x-1) does the same thing indices[count++] = k; } } |
Population Count
If we’re going to use bitfields to represent sets, we might also need to be able to compute the cardinality of our set. This can be done by adding up four bytes from a 256-entry LUT, or an ugly loop like:
uint CountBits( uint x ) { uint count=0; while( x ) { count += x & 1; x = x >> 1; } return count; } |
But now there’s an instruction for this:
uint CountBits( uint x ) { return _mm_popcnt_u32(x); } |
See here for one of my favorite graphics usages. It’s worth noting that in DX11 shader hardware has this too.
Morton Numbers
The Morton Curve (or “Z curve”) is generated by de-interleaving bits to form coordinates. Interleaving the bits of two N-bit coordinates produces a single 2N bit number. It turns out that if we count off consecutive integers and de-interleave the bits, the resulting coordinates for two consecutive numbers are very likely to be close to each other. This is used all over the place in graphics, especially for rasterization, since building hardware to generate coordinates from morton numbers is stupidly easy.
Unfortunately, doing it in software kindof sucks. You end up with something like:
uint morton( uint8 x, uint8 y ) { uint mort=0; for( uint i=0; i<8; i++ ) // hopefully unrolled by our compiler mort |= (x&(1<<i)) <<i | (y&(1<<i)) << (i+1); return mort; } |
Haswell has an instruction called “parallel bit deposit” which seems purpose-built for this. It works by taking a set of consecutive low-order bits from the input and scattering them to a set of bit positions designated by a mask. For example: deposit( 101, 1101 ) produces 1001. There’s also a “parallel bit extract” which does the reverse.
What used to require a loop, or a big LUT and a bunch of shifts, now takes just a few instructions:
uint morton( uint8 x, uint8 y ) { return _pdep_u32(x,0x5555) | _pdep_u32(y,0xAAAA); } |
Somebody somewhere is going to use this for BVH construction, among other things.
I got to know these bit manipulation instructions quite intimately one time when I was optimizing a decoder for a newly developed image compression algorithm. Entropy decoder was full of stuff like this, like trying to find the first zero in an integer (lzcount of negated integer). I think I managed to get out an order of magnitude of performance in that part of the code just by using these instructions.