Bindless Chain Letter – The Burning Basis Vector

There’s a great dialog going on about bindless resources.

I was going to post about this, but Timothy Lottes and Fabien Giesen have started a discussion. Since AMD recently released CodeXLAnalyzer, we can look at what the actual driver is doing.

I got these results from the Catalyst 14.6 beta.

Test Shader

sampler s0;
sampler s1;
sampler s2;
sampler s3;
Texture2D&lt;float4&gt; t0;
Texture2D&lt;float4&gt; t1;
Texture2D&lt;float4&gt; t2;
Texture2D&lt;float4&gt; t3;
 
float4 PS_GL_Like( float2 uv : TEXCOORD0 ) : SV_Target
{
return t0.Sample(s0,uv) + t1.Sample(s1,uv) + t2.Sample(s2,uv) + t3.Sample(s3,uv);
}
 
float4 PS_DX_Like( float2 uv : TEXCOORD0 ) : SV_Target
{
 
return t0.Sample(s0,uv) + t1.Sample(s0,uv) + t2.Sample(s0,uv) + t3.Sample(s0,uv);
 
}

Useful bat file:

"C:\Program Files (x86)\AMD\CodeXL\CodeXLAnalyzer.exe" --DXLocation "C:\Program Files (x86)\Windows Kits\8.1\bin\x86\d3dcompiler_47.dll" -p ps_5_0 -s HLSL -f %2 -c Hawaii --isa %2 %1

DX-like result:

s_mov_b64 s[0:1], exec 
s_wqm_b64 exec, exec 
s_mov_b32 m0, s8 
s_load_dwordx8 s[8:15], s[2:3], 0x00 
s_load_dwordx8 s[16:23], s[2:3], 0x08 
s_load_dwordx8 s[24:31], s[2:3], 0x10
s_load_dwordx8 s[32:39], s[2:3], 0x18 
v_interp_p1_f32 v14, v0, attr0.x 
v_interp_p1_f32 v15, v0, attr0.y 
v_interp_p2_f32 v14, v1, attr0.x 
v_interp_p2_f32 v15, v1, attr0.y 
s_waitcnt lgkmcnt(0) 
image_sample v[2:5], v[14:17], s[8:15], s[4:7]
image_sample v[6:9], v[14:17], s[16:23], s[4:7]
image_sample v[10:13], v[14:17], s[24:31], s[4:7]
image_sample v[14:17], v[14:17], s[32:39], s[4:7]
s_waitcnt vmcnt(2) 
v_add_f32 v0, v2, v6 
v_add_f32 v1, v3, v7 
v_add_f32 v2, v4, v8 
v_add_f32 v3, v5, v9 
s_waitcnt vmcnt(1) 
v_add_f32 v0, v0, v10 
v_add_f32 v1, v1, v11 
v_add_f32 v2, v2, v12 
v_add_f32 v3, v3, v13 
s_waitcnt vmcnt(0) 
v_add_f32 v0, v0, v14 
v_add_f32 v1, v1, v15 
v_add_f32 v2, v2, v16 
v_add_f32 v3, v3, v17 
s_mov_b64 exec, s[0:1] 
v_cvt_pkrtz_f16_f32 v0, v0, v1 
v_cvt_pkrtz_f16_f32 v1, v2, v3 
exp mrt0, v0, v0, v1, v1 done compr vm 
s_endpgm 
end

GL-like result:

  s_mov_b64     s[52:53], exec  
  s_wqm_b64     exec, exec            
  s_mov_b32     m0, s6
  s_load_dwordx8  s[8:15], s[2:3], 0x00  
  s_load_dwordx4  s[16:19], s[4:5], 0x00 
  s_load_dwordx8  s[20:27], s[2:3], 0x08 
  s_load_dwordx4  s[28:31], s[4:5], 0x04 
  s_load_dwordx8  s[32:39], s[2:3], 0x10 
  s_load_dwordx4  s[40:43], s[4:5], 0x08  
  s_load_dwordx8  s[44:51], s[2:3], 0x18 
  s_load_dwordx4  s[0:3], s[4:5], 0x0c   
  v_interp_p1_f32  v14, v0, attr0.x
  v_interp_p1_f32  v15, v0, attr0.y  
  v_interp_p2_f32  v14, v1, attr0.x
  v_interp_p2_f32  v15, v1, attr0.y
  s_waitcnt     lgkmcnt(0) 
  image_sample  v[2:5], v[14:17], s[8:15], s[16:19]
  image_sample  v[6:9], v[14:17], s[20:27], s[28:31]
  image_sample  v[10:13], v[14:17], s[32:39], s[40:43]
  image_sample  v[14:17], v[14:17], s[44:51], s[0:3]
  s_waitcnt     vmcnt(2)
  v_add_f32     v0, v2, v6
  v_add_f32     v1, v3, v7 
  v_add_f32     v2, v4, v8 
  v_add_f32     v3, v5, v9
  s_waitcnt     vmcnt(1)   
  v_add_f32     v0, v0, v10
  v_add_f32     v1, v1, v11
  v_add_f32     v2, v2, v12 
  v_add_f32     v3, v3, v13 
  s_waitcnt     vmcnt(0)
  v_add_f32     v0, v0, v14  
  v_add_f32     v1, v1, v15
  v_add_f32     v2, v2, v16 
  v_add_f32     v3, v3, v17 
  s_mov_b64     exec, s[52:53]
  v_cvt_pkrtz_f16_f32  v0, v0, v1
  v_cvt_pkrtz_f16_f32  v1, v2, v3 
  exp           mrt0, v0, v0, v1, v1 done compr vm
  s_endpgm
end

Some things to note:

The descriptors are 32byte. Not what we expected. I haven’t dug into the descriptor format so maybe there’s a good reason. I could understand if the driver padded its descriptors out to full size, but in that case I’d expect the compiler not to load all the padding.
They do not combine the s_load_dword_x8s into x16s, as Fabien suggests, but they certainly could. There might be good reasons for not combining them, but if there are, its a point in favor of the GL bind model. One reason not to combine them might be to allow the sample instructions to start independently of each other as soon as their respective s_load finishes, but that doesn’t seem to be what’s happening, because there’s only one s_wait between the s_loads and the sample instructions.
In the DX-like one, our single sampler is pre-loaded as part of the launch payload. Nice little trick. Point in favor of seperate texture/sampler. Also, not something that could be easily done with bindless.

Tim writes:
“The AMD ISA docs document the K$ as 16-byte wide (which is why 64-byte block loads only need 4 scalar alignment in the scalar register file). A 16-byte load would have a throughput of 1 clock, a 32-byte load in 2 clocks, and a 64-byte load in 4 clocks. With modern ALUop:TEXop ratios of 16:1 or so, some percent of an extra clock for GL might not matter. The last issue would be K$ utilization with some possible duplication of samplers. K$ is 16KB. If a shader uses 16 2D textures and 4 samplers: DX = 320B or about 2% of the K$, GL = 512B or about 3% of the K$. Does that matter?”

It looks like the K$ is backed by L2, so it matters a little bit if its pushing out lines that might be usable for textures and such. The other thing I’m wondering is whether state changes are pipelined through the K$. If a bind change means shifting the table base address forwards, then the K$ utilization is loosely correllated with how well the whole pipeline can tolerate bind changes.

Fabian writes:
“For GL bindless mode, you would presumably use a single global descriptor table that all your handles point into, and would preload the base address to that table using the 16 scalars you get to set per draw.”

The handles could also just be pointers directly to the descriptors, in which case, they wouldn’t need the base address anymore. The driver could just allocate descriptors wherever the it likes. I suspect that this is the reason why the bindless handles are 64bit.
If that is true, then GL bindless could lead to thrashing in the K$ if the addresses aren’t localized enough. This is the thing that makes me uneasy about GL bindless. Ignoring samplers, if we diagram the two models (bind and bindless), they look like this:

Observations about the bottom one:

DX12 bindless looks a lot like the top one.
Aweful lot of indirections, its not so bad if you have explicit control over where the descriptors are placed, but it could start thrashing if you don’t.
With the bind model, there’s the opportunity, in theory, to prefetch descriptors ahead of a draw. Not sure if that’s feasible with GL bindless.