More Bindless – The Burning Basis Vector

Continuing the bindless texture discussion with Timothy Lottes (sorry for the delay, its been a busy week).

I’m actually wondering if the “block loads” are as important an optimization as we think they are. Here is my thinking: K$ rate is 4 dwords/clk. There’s one scalar unit per CU and waves are issued round-robin among the 4 SIMDs, one SIMD per clock. All of this is in the crash course. This means that even if a wave could get its 4 dwords out in one clock, it still needs to wait for 4 clocks to elapse before doing anything else. The rate for a 4x read is 1 dword/SIMD/clk.

My undocumented guess is that the K$ always ends up hitting 1 dword/SIMD/clk in the case when all SIMDs are issuing reads at once. Maybe individual SIMDs can also use more bandwidth under low contention. What I’m getting at is that two 4x loads and one 8x load might end up having the same cost with enough occupancy, which could explain why the compiler didn’t combine things like we thought it would. If true, it means that dword count is more important than instruction count. Hard to tell without checking the non-public docs, and once I do that I have to stop talking.

For any design where descriptors end up getting random accessed, it favors the GL combined {texture, sampler} design (no extra random access for samplers).

But that depends on how much redundancy there is between the sampler and texture descriptors. We might have say 1000 textures and 50 samplers in our scene. It might be less convenient to implement, but there could be merit to issuing extra loads in order to make better use of the cache. It seems like its a design point that we’d want the API to leave open. Different apps will have different needs/preferences here.

This brings up two issues which should be thought through better in any adjustments to the GL bindless design:

“(a.) Ideally texture descriptor location in a global table would be up to the developer. This way they could pair two {16-byte texture descriptor, 16-sampler descriptor} descriptor pairs which get accessed in the same material into a 64-byte line for the common case of non-array 2D textures. This provides no waste.”

For GCN at least, ideal is probably to allow the texture/sampler descriptors to be freely mixed. That way the shader has the freedom to use GL-style and put sampler/texture adjacent (allowing the compiler to combine the loads), or DX style using seperate indices. Or even DX-style using 24:8 index packing if it likes. You could also do things like amortize across related textures in a set. If, e.g. diffuse,normal,spec are always resident together and use the same filtering, can put sampler,diffuse,normal,spec all in one line, and then have one index for all of them.

“(b.) Fixed worst case 64-byte packing {32-byte texture descriptor, 16-byte sampler descriptor, 16-byte padding} does not look like a good option since in the common case, 32-bytes/line would get wasted.”

Seperate texture/sampler helps here too, always 4 samplers/line and 2-4 textures/line.

“Descriptor table base pointer pre-loaded, constant buffer addresses preloaded <- free to shader .. Block load <= 16 constants, const 32-bit resource offset amortized to a fraction of 1 indirection …The fractional indirection does not matter”

Good point that small handles could be amortized across load instructions, but we’re still using up registers and K$ space that a bind model could avoid. A given shader might be better off by just shifting the base address around, and this is an option that the API could make available.

We have this:

r0:r4 = load_4x( cbuffer_base );
load_8x(descriptor_base+r0);
load_8x(descriptor_base+r1);
load_8x(descriptor_base+r2);
load_8x(descriptor_base+r3);

When we could have this:

load_32x( descriptor_base );

A particular app might decide to do the top one anyway, but better to give it the choice.

I see two main motivations for bindless:

It enables things that were difficult before, by allowing a shader access to access any texture at any time.
It avoids cache misses in the driver. Much was made of this back when Nvidia did the original bindless extension.

Reason #1 is all well and good. Bindless enables new applications and is thus a welcome development. Reason #2, though, might just be trading overhead in one place for overhead in another place. Before we do that, we aught to make sure that the first place is already as efficient as we can make it, and I don’t think this is true of either DX11 or OGL. Today’s bind model basically has the app providing arrays of pointers which get chased to build contiguous descriptor blocks. Instead, the app could just provide a contiguous descriptor block.

The workaround for these problems is to expose something which is a global descriptor table. The API would provide functionality similar to the following: ….

This is what I’d like to see as well, but with samplers added. If you think about it, the descriptor table is nothing more than a constant buffer with very particular contents. “Binding” is just moving the table pointer. This could be presented to the shader as a special uniform buffer that contains only textures and samplers. There could also be intrinsics to fetch from a given offset and reinterpret as a particular descriptor type. App would be responsible for laying out the table and passing correct offsets to shaders, or for binding the table at the right offset for a particular draw (these are the same thing).