# Why Geometry Shaders Are Slow (Unless you’re Intel)

Let’s look at a simple, constrained test case. Suppose I want to render a very large number of oriented boxes. That makes for a rather boring scene, yes, but there are legitimate use cases in which I might want to do it. I might have a deferred renderer that’s rendering light bounding volumes, or maybe I’m doing an object space AO technique like Oat, Shopf, and I published in ShaderX7, under the title: “Deferred Occclusion From Analytic Surfaces”. Or, maybe I’m rendering lots of bounding volumes for occlusion culling. Let’s just pretend for a moment that this is an important workload. How do I do this as fast as possible?

For our experiments, we’ll render the boxes using a simple pixel shader the uses derivative trickery to extract a normal, and then does diffuse lighting. This is only really to give us something interesting to see and to make sure that we didn’t mangle the geometry:

row_major float4x4 g_RasterToWorld; float4 main( float4 v : SV_Position ) : SV_Target { float4 p = mul( float4(v.xyz,1), g_RasterToWorld); p.xyz/= p.w; float3 n = normalize( cross( ddx(p.xyz), ddy(p.xyz) ) ); float3 L = normalize(float3(1,1,-1)); float d = saturate(dot(n,L)); return float4(d.xxx,1); }

We’ll also pick a viewpoint so that most of the boxes are off screen. I’ve repeated this for all boxes offscreen and gotten essentially the same result. You can find all of the code here. I built it on my laptop using VC++ express 2013. If you’re lucky, it might work.

Let’s consider three ways of rendering the boxes:

## Method 1: The Obvious

An oriented box can be defined by a 3×4 transform matrix which deforms a unit cube into the box of interest. The columns of this matrix contain the center and axes of our box, as shown below:

$\begin{bmatrix} X_x & Y_x & Z_x & C_x \\ X_y & Y_y & Z_y & C_y \\ X_z & Y_z & Z_z & C_z \\\end{bmatrix}$

So, let’s just render a bunch of instanced cubes. We have one unit box mesh, and a buffer packed full of transform matrices, and we use an ordinary instanced drawcall on it. Our vertex shader looks like this:

uniform row_major float4x4 g_ViewProj;   float4 main( float4 v : POSITION, float4 R0 : XFORM0, // R0,R1,R2 are a 3x4 transform matrix float4 R1 : XFORM1, float4 R2 : XFORM2 ) : SV_Position { // deform unit box into desired oriented box float3 vPosWS = float3( dot(v,R0), dot(v,R1), dot(v,R2) );   // clip-space transform return mul( float4(vPosWS,1), g_ViewProj ); }

I’m using float4s and dot products instead of a float3x4 matrix type, because TBH I can never keep N and M straight in my head, and the dot products are just easier on my brain.

## Method 2: The “Instancing Sucks” Way

Let’s do the same thing, except let’s not use instancing. The reasons for this will become clear shortly. Now, we don’t want to go doing one drawcall per box, because that would just introduce more overhead. We also don’t want to duplicate the same unit cube 250K times. That would consume unnecessary bandwidth. Instead, we’ll do this by generating a gigantic index buffer and doing SV_VertexID math. We know that cube i will reference vertices 8*i through 8*i+7, so we can figure out our own local instance and vertex ID from a flat index buffer. The only drawback is that now we need to fetch our vertex and instance data explicitly:

  uniform row_major float4x4 g_ViewProj;   Buffer<float4> Verts; Buffer<float4> XForms;   float4 main( uint vid : SV_VertexID ) : SV_Position { uint xform = vid/8; float4 v = Verts[vid%8]; float4 R0 = XForms[3*xform]; float4 R1 = XForms[3*xform+1]; float4 R2 = XForms[3*xform+2];   // deform unit box into desired oriented box float3 vPosWS = float3( dot(v,R0), dot(v,R1), dot(v,R2) );   // clip-space transform return mul( float4(vPosWS,1), g_ViewProj ); }

## Method 3: The Clever Way

Let’s think outside the box about our boxes for a second. What is it that we have to do in order to draw a box? We need to compute the clip space positions of each of its 8 vertices. That’s all. So far, we’ve done this by doing a pair of matrix multiplications. We deform a unit cube into our oriented box, and then apply our view-projection matrix to the result. Like so:

$v = M_{vp} *\begin{bmatrix} X_x & Y_x & Z_x & C_x \\ X_y & Y_y & Z_y & C_y \\ X_z & Y_z & Z_z & C_z \\\end{bmatrix}* \begin{bmatrix} \pm 1 \\ \pm 1 \\ \pm 1 \\ 1 \end{bmatrix}$

If we account for the fact that our vertex coordinates are all +-1, then we can boil the world matrix down to a series of vector adds and subtracts, like so:

$v = M_{vp} * [ \begin{bmatrix} C_x \\ C_y \\ C_z \\ 1 \end{bmatrix} \pm \begin{bmatrix} X_x \\ X_y \\ X_z \\ 0 \end{bmatrix} \pm \begin{bmatrix} Y_x \\ Y_y \\ Y_z \\ 0 \end{bmatrix} \pm \begin{bmatrix} Z_x \\ Z_y \\ Z_z \\ 0 \end{bmatrix} ]$

Now we can exploit the distributive property of matrix multiplication and factor out the view-projection transform, like so:

$v = M_{vp}C \pm M_{vp}X \pm M_{vp}Y \pm M_{vp}Z$

The geometric interpretation of this is that instead of expanding our box in world-space, and then transforming the results to clip space, we instead pre-transform its basis vectors and center-point, and then do the expansion in clip space. By factoring things this way we can eliminate quite a few subexpressions.

The final piece of the puzzle is to minimize the number of verts that our GS emits. Some clever individual figured out how to represent a cube using a 14 vertex strip. That’s a considerable improvement over the 24 verts we might emit if we did it 2 triangles at a time. Here is the full geometry shader:

struct GSIn { float4 R0 : XFORM0; float4 R1 : XFORM1; float4 R2 : XFORM2; }; struct GSOut { float4 v : SV_Position; };   void emit( inout TriangleStream<GSOut> triStream, float4 v ) { GSOut s; s.v = v; triStream.Append(s); }   uniform row_major float4x4 g_ViewProj; void GenerateTransformedBox( out float4 v[8], float4 R0, float4 R1, float4 R2 ) { float4 center =float4( R0.w,R1.w,R2.w,1); float4 X = float4( R0.x,R1.x,R2.x,0); float4 Y = float4( R0.y,R1.y,R2.y,0); float4 Z = float4( R0.z,R1.z,R2.z,0); center = mul( center, g_ViewProj ); X = mul( X, g_ViewProj ); Y = mul( Y, g_ViewProj ); Z = mul( Z, g_ViewProj );   float4 t1 = center - X - Z ; float4 t2 = center + X - Z ; float4 t3 = center - X + Z ; float4 t4 = center + X + Z ; v[0] = t1 + Y; v[1] = t2 + Y; v[2] = t3 + Y; v[3] = t4 + Y; v[4] = t1 - Y; v[5] = t2 - Y; v[6] = t4 - Y; v[7] = t3 - Y; } // http://www.asmcommunity.net/forums/topic/?id=6284 static const int INDICES[14] = { 4, 3, 7, 8, 5, 3, 1, 4, 2, 7, 6, 5, 2, 1, };   [maxvertexcount(14)] void main( point GSIn box[1], inout TriangleStream<GSOut> triStream ) { float4 v[8]; GenerateTransformedBox( v, box[0].R0, box[0].R1, box[0].R2 );   // Indices are off by one, so we just let the optimizer fix it [unroll] for( int i=0; i<14; i++ ) emit(triStream, v[INDICES[i]-1] ); }

## A Rather Deceptive Static Analysis

Here is the DX bytecode for our uninstanced vertex shader:

vs_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb0[4], immediateIndexed dcl_input v0.xyzw dcl_input v1.xyzw dcl_input v2.xyzw dcl_input v3.xyzw dcl_output_siv o0.xyzw, position dcl_temps 2 dp4 r0.x, v0.xyzw, v2.xyzw mul r0.xyzw, r0.xxxx, cb0[1].xyzw dp4 r1.x, v0.xyzw, v1.xyzw mad r0.xyzw, r1.xxxx, cb0[0].xyzw, r0.xyzw dp4 r1.x, v0.xyzw, v3.xyzw mad r0.xyzw, r1.xxxx, cb0[2].xyzw, r0.xyzw add o0.xyzw, r0.xyzw, cb0[3].xyzw ret

If we count flops, we find that our vertex shader contains 28 flops. A dp4 is a mul followed by 3 mads, and a mad is one flop regardless of what marketing people would like you to believe. Multiply that by 8, and we get a total of 224 flops per box to transform a box. We’ve only got 8 verts so the post TnL cache will remove the duplication.

gs_5_0 dcl_globalFlags refactoringAllowed dcl_constantbuffer cb0[4], immediateIndexed dcl_input v[1][0].xyzw dcl_input v[1][1].xyzw dcl_input v[1][2].xyzw dcl_temps 7 dcl_inputprimitive point dcl_stream m0 dcl_outputtopology trianglestrip dcl_output_siv o0.xyzw, position dcl_maxout 14 mul r0.xyzw, cb0[1].xyzw, v[0][1].wwww mad r0.xyzw, v[0][0].wwww, cb0[0].xyzw, r0.xyzw mad r0.xyzw, v[0][2].wwww, cb0[2].xyzw, r0.xyzw add r0.xyzw, r0.xyzw, cb0[3].xyzw mul r1.xyzw, cb0[1].xyzw, v[0][1].xxxx mad r1.xyzw, v[0][0].xxxx, cb0[0].xyzw, r1.xyzw mad r1.xyzw, v[0][2].xxxx, cb0[2].xyzw, r1.xyzw add r2.xyzw, r0.xyzw, r1.xyzw add r0.xyzw, r0.xyzw, -r1.xyzw mul r1.xyzw, cb0[1].xyzw, v[0][1].zzzz mad r1.xyzw, v[0][0].zzzz, cb0[0].xyzw, r1.xyzw mad r1.xyzw, v[0][2].zzzz, cb0[2].xyzw, r1.xyzw add r3.xyzw, r1.xyzw, r2.xyzw add r2.xyzw, -r1.xyzw, r2.xyzw mul r4.xyzw, cb0[1].xyzw, v[0][1].yyyy mad r4.xyzw, v[0][0].yyyy, cb0[0].xyzw, r4.xyzw mad r4.xyzw, v[0][2].yyyy, cb0[2].xyzw, r4.xyzw add r5.xyzw, r3.xyzw, r4.xyzw add r3.xyzw, r3.xyzw, -r4.xyzw mov o0.xyzw, r5.xyzw emit_stream m0 add r6.xyzw, r0.xyzw, r1.xyzw add r0.xyzw, r0.xyzw, -r1.xyzw add r1.xyzw, r4.xyzw, r6.xyzw add r6.xyzw, -r4.xyzw, r6.xyzw mov o0.xyzw, r1.xyzw emit_stream m0 mov o0.xyzw, r3.xyzw emit_stream m0 mov o0.xyzw, r6.xyzw emit_stream m0 add r6.xyzw, -r4.xyzw, r0.xyzw add r0.xyzw, r4.xyzw, r0.xyzw mov o0.xyzw, r6.xyzw emit_stream m0 mov o0.xyzw, r1.xyzw emit_stream m0 mov o0.xyzw, r0.xyzw emit_stream m0 mov o0.xyzw, r5.xyzw emit_stream m0 add r1.xyzw, r2.xyzw, r4.xyzw add r2.xyzw, r2.xyzw, -r4.xyzw mov o0.xyzw, r1.xyzw emit_stream m0 mov o0.xyzw, r3.xyzw emit_stream m0 mov o0.xyzw, r2.xyzw emit_stream m0 mov o0.xyzw, r6.xyzw emit_stream m0 mov o0.xyzw, r1.xyzw emit_stream m0 mov o0.xyzw, r0.xyzw emit_stream m0 ret

If we look at just the math, our total here is 108 flops per box. I’m going to assume that the driver will get rid of those idiotic movs, but it’s hard to tell how much those ’emit’ things actually cost. Still, we’re only emitting 56 dwords, so even if its around one ‘flop’ per dword, we’re still going to get a really nice speedup, or so we think.

## Results

I’ve run this on three different machines. An Nvidia GTX 670, an old AMD A10 APU (Radeon HD7760G, a VLIW-4 chip), and the Haswell Graphics chip in my personal laptop (Core i3-4010U). In fairness to AMD and Nvidia, I’m not using their latest and greatest architectures, so the comparison isn’t exactly fair, and I’d be happy to post updated results as I receive them. Still, my point here is to illustrate the pitfalls of geometry shaders, not to turn this into some sort of IHV shootout. These GPUs serve that purpose well enough. Results are normalized below so that nobody looks bad:

Update: Added R9 290 results from commenter smz.

The first thing we notice is that instancing sucks. If you are rendering a very small number of verts per instance, you want to do it without instancing, because all three geometry pipelines take a hit. Especially those poor blue fellas. This result is just plain irritating. In order to avoid whatever silly bottleneck this is, it’s necessary for us to create a redundant index buffer and add a few spurious instructions to the vertex shader. It’s counter-intuitive, and a tad wasteful. They probably need to flush the post-transform cache between instances, but they could at least try and mitigate this by adding a few instance ID bits to the cache tag. I’ll put up with a lower maximum index value if that helps.

The second thing we notice is that our GS idea was not as clever as we thought. On both the AMD and Nvidia parts, our clever idea, which was supposed to cut our workload in half, has instead hurt us. To understand why, let’s dig a little deeper.

## Why Half As Much Math Goes Slower

At this point we really need to see some actual shader assembly, but unfortunately, only one architecture lets us do this. So, I’ll do what I do during my day job. I will assume that everybody’s hardware is exactly the same as GCN, and make all of my shader design decisions accordingly. If the blue and green teams are made nervous by that, then perhaps they should start listening to my incessant whining and give me an offline bytecode compiler. ðŸ™‚

Here’s the GCN shader. The syntax looks different from what you’re used to, because I’m using my renegade disassembler.

 S_MOV_B32 M0, s9 S_LOAD_DWORDX4 s12[4], s0[2], 48 S_MOVK_I32 s2, 1792 V_LSHLREV_B32 v0, 2, v0 S_MOVK_I32 s3, 1024 S_MOVK_I32 s9, 768 S_MOVK_I32 s10, 1536 S_MOVK_I32 s11, 2816 S_MOVK_I32 s16, 1280 S_MOVK_I32 s17, 512 S_MOVK_I32 s18, 2048 S_MOVK_I32 s19, 256 S_MOVK_I32 s20, 2560 S_WAITCNT vmcnt(15) expcnt(7) lkgmcnt(0) BUFFER_LOAD_DWORD v1, s12[4] [s2+v0] GLC+SLC S_MOVK_I32 s2, 2304 S_NOP(1) BUFFER_LOAD_DWORD v2, s12[4] [s3+v0] GLC+SLC BUFFER_LOAD_DWORD v3, s12[4] [s9+v0] GLC+SLC BUFFER_LOAD_DWORD v4, s12[4] [s10+v0] GLC+SLC BUFFER_LOAD_DWORD v5, s12[4] [v0] GLC+SLC BUFFER_LOAD_DWORD v6, s12[4] [s11+v0] GLC+SLC BUFFER_LOAD_DWORD v7, s12[4] [s16+v0] GLC+SLC BUFFER_LOAD_DWORD v8, s12[4] [s17+v0] GLC+SLC BUFFER_LOAD_DWORD v9, s12[4] [s18+v0] GLC+SLC BUFFER_LOAD_DWORD v10, s12[4] [s19+v0] GLC+SLC BUFFER_LOAD_DWORD v11, s12[4] [s20+v0] GLC+SLC BUFFER_LOAD_DWORD v0, s12[4] [s2+v0] GLC+SLC S_BUFFER_LOAD_DWORDX4 s12[4], s4[2], 16 S_BUFFER_LOAD_DWORDX4 s16[4], s4[2], 0 S_BUFFER_LOAD_DWORDX4 s20[4], s4[2], 32 S_BUFFER_LOAD_DWORDX4 s4[4], s4[2], 48 S_WAITCNT vmcnt(11) expcnt(7) lkgmcnt(0) V_MUL_F32 v12, s12, v1 V_MUL_F32 v13, s13, v1 S_WAITCNT vmcnt(9) expcnt(7) lkgmcnt(15) V_MAC_F32 v12, s16, v3 V_MUL_F32 v14, s12, v2 V_MUL_F32 v15, s14, v1 V_MAC_F32 v13, s17, v3 S_WAITCNT vmcnt(6) expcnt(7) lkgmcnt(15) V_MAC_F32 v12, s20, v6 V_MUL_F32 v16, s13, v2 V_MAC_F32 v14, s16, v5 V_MUL_F32 v17, s12, v4 V_MUL_F32 v1, s15, v1 V_MAC_F32 v15, s18, v3 V_MAC_F32 v13, s21, v6 V_ADD_F32 v12, s4, v12 V_MUL_F32 v18, s14, v2 V_MAC_F32 v16, s17, v5 S_WAITCNT vmcnt(3) expcnt(7) lkgmcnt(15) V_MAC_F32 v14, s20, v9 V_MUL_F32 v19, s13, v4 V_MAC_F32 v17, s16, v8 V_MUL_F32 v20, s12, v7 V_MAC_F32 v1, s19, v3 V_MAC_F32 v15, s22, v6 V_ADD_F32 v3, s5, v13 V_MUL_F32 v2, s15, v2 V_MAC_F32 v18, s18, v5 V_MAC_F32 v16, s21, v9 V_ADD_F32 v13, v12, v14 V_MUL_F32 v21, s14, v4 V_MAC_F32 v19, s17, v8 S_WAITCNT vmcnt(1) expcnt(7) lkgmcnt(15) V_MAC_F32 v17, s20, v11 V_MUL_F32 v22, s13, v7 V_MAC_F32 v20, s16, v10 S_LOAD_DWORDX4 s0[4], s0[2], 64 V_MAC_F32 v1, s23, v6 V_ADD_F32 v6, s6, v15 V_MAC_F32 v2, s19, v5 V_MAC_F32 v18, s22, v9 V_ADD_F32 v5, v3, v16 V_MUL_F32 v4, s15, v4 V_MAC_F32 v21, s18, v8 V_MAC_F32 v19, s21, v11 V_ADD_F32 v15, v13, v17 V_MUL_F32 v23, s14, v7 V_MAC_F32 v22, s17, v10 S_WAITCNT vmcnt(0) expcnt(7) lkgmcnt(15) V_MAC_F32 v20, s20, v0 V_ADD_F32 v1, s7, v1 V_MAC_F32 v2, s23, v9 V_ADD_F32 v9, v6, v18 V_MAC_F32 v4, s19, v8 V_MAC_F32 v21, s22, v11 V_ADD_F32 v8, v5, v19 V_MUL_F32 v7, s15, v7 V_MAC_F32 v23, s18, v10 V_MAC_F32 v22, s21, v0 V_ADD_F32 v24, v15, v20 V_ADD_F32 v25, v1, v2 V_MAC_F32 v4, s23, v11 V_ADD_F32 v11, v9, v21 V_MAC_F32 v7, s19, v10 V_MAC_F32 v23, s22, v0 V_ADD_F32 v10, v8, v22 V_ADD_F32 v26, v25, v4 V_MAC_F32 v7, s23, v0 V_ADD_F32 v0, v11, v23 V_SUB_F32 v12, v12, v14 V_ADD_F32 v14, v26, v7 V_SUB_F32 v3, v3, v16 S_WAITCNT vmcnt(15) expcnt(7) lkgmcnt(0) BUFFER_STORE_DWORD v24, s0[4] [s8] GLC+SLC V_ADD_F32 v16, v12, v17 V_SUB_F32 v6, v6, v18 BUFFER_STORE_DWORD v10, s0[4] [s8+56] GLC+SLC V_ADD_F32 v18, v3, v19 V_ADD_F32 v27, v20, v16 V_SUB_F32 v1, v1, v2 BUFFER_STORE_DWORD v0, s0[4] [s8+112] GLC+SLC V_ADD_F32 v2, v6, v21 V_ADD_F32 v28, v22, v18 BUFFER_STORE_DWORD v14, s0[4] [s8+168] GLC+SLC V_ADD_F32 v29, v1, v4 V_ADD_F32 v30, v23, v2 S_SENDMSG GS: EMIT V_ADD_F32 v31, v7, v29 BUFFER_STORE_DWORD v27, s0[4] [s8+4] GLC+SLC V_SUB_F32 v15, v15, v20 BUFFER_STORE_DWORD v28, s0[4] [s8+60] GLC+SLC V_SUB_F32 v8, v8, v22 BUFFER_STORE_DWORD v30, s0[4] [s8+116] GLC+SLC V_SUB_F32 v11, v11, v23 BUFFER_STORE_DWORD v31, s0[4] [s8+172] GLC+SLC V_SUB_F32 v26, v26, v7 S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v15, s0[4] [s8+8] GLC+SLC V_SUBREV_F32 v16, v20, v16 BUFFER_STORE_DWORD v8, s0[4] [s8+64] GLC+SLC V_SUBREV_F32 v18, v22, v18 BUFFER_STORE_DWORD v11, s0[4] [s8+120] GLC+SLC V_SUBREV_F32 v2, v23, v2 BUFFER_STORE_DWORD v26, s0[4] [s8+176] GLC+SLC V_SUBREV_F32 v29, v7, v29 S_SENDMSG GS: EMIT V_SUB_F32 v12, v12, v17 BUFFER_STORE_DWORD v16, s0[4] [s8+12] GLC+SLC V_SUB_F32 v3, v3, v19 BUFFER_STORE_DWORD v18, s0[4] [s8+68] GLC+SLC S_WAITCNT vmcnt(15) expcnt(1) lkgmcnt(15) V_SUBREV_F32 v16, v20, v12 V_SUB_F32 v6, v6, v21 BUFFER_STORE_DWORD v2, s0[4] [s8+124] GLC+SLC S_WAITCNT vmcnt(15) expcnt(0) lkgmcnt(15) V_SUBREV_F32 v2, v22, v3 V_SUB_F32 v1, v1, v4 BUFFER_STORE_DWORD v29, s0[4] [s8+180] GLC+SLC V_SUBREV_F32 v18, v23, v6 S_SENDMSG GS: EMIT S_WAITCNT vmcnt(15) expcnt(0) lkgmcnt(15) V_SUBREV_F32 v29, v7, v1 BUFFER_STORE_DWORD v16, s0[4] [s8+16] GLC+SLC BUFFER_STORE_DWORD v2, s0[4] [s8+72] GLC+SLC BUFFER_STORE_DWORD v18, s0[4] [s8+128] GLC+SLC BUFFER_STORE_DWORD v29, s0[4] [s8+184] GLC+SLC S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v27, s0[4] [s8+20] GLC+SLC V_ADD_F32 v12, v20, v12 BUFFER_STORE_DWORD v28, s0[4] [s8+76] GLC+SLC V_ADD_F32 v3, v22, v3 BUFFER_STORE_DWORD v30, s0[4] [s8+132] GLC+SLC V_ADD_F32 v6, v23, v6 BUFFER_STORE_DWORD v31, s0[4] [s8+188] GLC+SLC V_ADD_F32 v1, v7, v1 S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v12, s0[4] [s8+24] GLC+SLC BUFFER_STORE_DWORD v3, s0[4] [s8+80] GLC+SLC BUFFER_STORE_DWORD v6, s0[4] [s8+136] GLC+SLC BUFFER_STORE_DWORD v1, s0[4] [s8+192] GLC+SLC S_SENDMSG GS: EMIT V_SUB_F32 v13, v13, v17 BUFFER_STORE_DWORD v24, s0[4] [s8+28] GLC+SLC V_SUB_F32 v5, v5, v19 BUFFER_STORE_DWORD v10, s0[4] [s8+84] GLC+SLC S_WAITCNT vmcnt(15) expcnt(0) lkgmcnt(15) V_ADD_F32 v10, v13, v20 V_SUB_F32 v9, v9, v21 BUFFER_STORE_DWORD v0, s0[4] [s8+140] GLC+SLC S_WAITCNT vmcnt(15) expcnt(0) lkgmcnt(15) V_ADD_F32 v0, v5, v22 V_SUB_F32 v4, v25, v4 BUFFER_STORE_DWORD v14, s0[4] [s8+196] GLC+SLC S_WAITCNT vmcnt(15) expcnt(0) lkgmcnt(15) V_ADD_F32 v14, v9, v23 S_SENDMSG GS: EMIT V_ADD_F32 v17, v4, v7 BUFFER_STORE_DWORD v10, s0[4] [s8+32] GLC+SLC BUFFER_STORE_DWORD v0, s0[4] [s8+88] GLC+SLC BUFFER_STORE_DWORD v14, s0[4] [s8+144] GLC+SLC BUFFER_STORE_DWORD v17, s0[4] [s8+200] GLC+SLC S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v15, s0[4] [s8+36] GLC+SLC V_SUB_F32 v13, v13, v20 BUFFER_STORE_DWORD v8, s0[4] [s8+92] GLC+SLC V_SUB_F32 v5, v5, v22 BUFFER_STORE_DWORD v11, s0[4] [s8+148] GLC+SLC S_WAITCNT vmcnt(15) expcnt(1) lkgmcnt(15) V_SUB_F32 v8, v9, v23 BUFFER_STORE_DWORD v26, s0[4] [s8+204] GLC+SLC V_SUB_F32 v4, v4, v7 S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v13, s0[4] [s8+40] GLC+SLC BUFFER_STORE_DWORD v5, s0[4] [s8+96] GLC+SLC BUFFER_STORE_DWORD v8, s0[4] [s8+152] GLC+SLC BUFFER_STORE_DWORD v4, s0[4] [s8+208] GLC+SLC S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v16, s0[4] [s8+44] GLC+SLC BUFFER_STORE_DWORD v2, s0[4] [s8+100] GLC+SLC BUFFER_STORE_DWORD v18, s0[4] [s8+156] GLC+SLC BUFFER_STORE_DWORD v29, s0[4] [s8+212] GLC+SLC S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v10, s0[4] [s8+48] GLC+SLC BUFFER_STORE_DWORD v0, s0[4] [s8+104] GLC+SLC BUFFER_STORE_DWORD v14, s0[4] [s8+160] GLC+SLC BUFFER_STORE_DWORD v17, s0[4] [s8+216] GLC+SLC S_SENDMSG GS: EMIT BUFFER_STORE_DWORD v12, s0[4] [s8+52] GLC+SLC BUFFER_STORE_DWORD v3, s0[4] [s8+108] GLC+SLC BUFFER_STORE_DWORD v6, s0[4] [s8+164] GLC+SLC BUFFER_STORE_DWORD v1, s0[4] [s8+220] GLC+SLC S_SENDMSG GS: EMIT S_WAITCNT vmcnt(0) expcnt(7) lkgmcnt(15) S_SENDMSG GS: DONE S_ENDPGM

By looking at the disassembly we can see something interesting. The shader is writing all of its output to memory. That’s right. Every vertex we emit from an AMD geometry shader has to make a round-trip through memory. Now, before you go bashing AMD for this, think about why it is they might be doing this.

The API requires that the output of a geometry shader be rendered in input order. The fixed-function hardware on the other side is required to consume geometry shader outputs serially. This creates a sync point. If we want to process multiple primitives in parallel, it is necessary for GS instances to buffer up their outputs so that they can be fed in the correct order to whoever is consuming them. The more parallelism, and the more verts our GS emits, the more buffering we need.

Recall that GPU shader pipelines operate in SIMD fashion. The amount of buffering we need is determined by the SIMD width. AMD’s SIMD is 64 threads wide, which means that in our case they must buffer 14336 bytes for every GS wave. For Nvidia, it’s 7168 bytes per warp. On a state of the art R9 with 40 CUs, we need at least 160 waves just to keep all of the schedulers occupied, which translates to over 2MB of buffering, and you’ll still need more than that because 1 per SIMD is not enough to run well.

There are only two places this buffering can exist. It’s either on chip, in a cache somewhere, or its off chip, in DRAM. If you put it on chip, you need to throttle the number of concurrent warps based on the amount of space you have, and if you put it off chip, it’s going to take that much longer for the consumer to get it, which means that unless the shader is really, really expensive there is no way you’ll be able to avoid being stalled on it. Back in the DX10 era, Nvidia went the on-chip route, and AMD went the off-chip route. I don’t think that either is particularly happy with the results.

Conversely, the vertex shader way, even though it does 2x as much work, only produces 16 bytes per thread, which translates to 1024 bytes/wave (AMD), or 512 (nv). By using a GS, we replace a large number of low-bandwidth threads with a small number of high bandwidth threads, and even though we perform less work, we still lose, because we’re not able to parallelize it as well.

## Why Intel’s Geometry Shaders Don’t Suck

This is my own speculation. I’d be curious to hear how close to the mark I am.

After a thorough perusal of their linux graphics docs, it seems that their GS works by blocking threads. Each thread generates its output, puts in registers, and waits its turn to feed it downstream. If an EU has a GS thread that is blocked at the sync point, it can start executing another GS thread while the first one is waiting. As long as the GS threads are doing a goodly amount of computation, the machine stays busy. I speculate that Intel gets away with this for two reasons:

1. Unlike the competition, Intel’s shader hardware has a full set of registers dedicated to each hardware thread. The red and green team each lose thread occupancy if a shader has a lot of register pressure, but not the blue team, they just exploit their ridiculous process advantage and pack the little suckers in, and then stop worrying about it. Our shader has quite a bit of register pressure in it, but that doesn’t hurt Intel’s concurrency one bit. Their enormous register file functions as a big on-chip buffer.

2. Intel’s shader EU’s are interesting in that they can operate in a variety of modes. They’re 8 wide, and can run in SIMD-8 mode (where each 8 threads issue one operation), or SIMD16 mode (where 16 threads issue from back to back registers), or SIMD4x2 mode, where 2 threads each issue 4 operations. SIMD4x2 is used in Haswell for VS and GS, and I suspect that its the main reason for the awesomeness. Intel is only running two GS invocations at a time, and is replacing data-parallelism with instruction level parallelism, which means that their per-thread bandwidth is a measly 448 bytes, an order of magnitude lower than everybody else’s.

These two factors together mean that Intel doesn’t suffer nearly as badly from the sync point. It takes much, much less time to consume 448 bytes than 14336, which means that the wait time is bearable, and there are plenty of threads available to cook new batches while the old ones are blocked.

Even though it is possible to implement geometry shaders efficiently, the fact that two of the three vendors don’t do it that way means that the GS is not a practical choice for production use. It should be avoided wherever possible.

It is flawed, in that it injects a serialized, high bandwidth operation into an already serialized part of the pipeline. It requires a lot of per-thread storage. It is clearly a very unnatural fit for wide SIMD machines. However, this little exercise has made me wonder if it can’t be redeemed by spreading a single instance across multiple warps/wavefronts, squeezing ILP out of a DLP architecture. Perhaps I’ll try and write a compute shader that does this.

Unfortunately, even if I did find a way to express my shader this way, it’s still not possible for me to USE such a shader as a GS. The APIs don’t permit it. In the future, perhaps we want a lower level model there, something like, “Here are N compute threads responsible for M input primitives.”, where N and M are both application-defined knobs. Food for thought, at least.

1. On the ISA compilation thing; I found out by accident an exploit to get NVIDIA to spit out the ISA with GLSL languages.

The following GLSL code will cause NVIDIA to generate invalid code, and hence dump its ISA (I tried this when using the debug context; I don’t know if it will spit it out using a non-debug context):

uniform sampler2D myTex;
layout( binding=0 ) uniform sampler2DArray myFaultyTex;

layout(location = 0, index = 0) out vec4 outColour;

void main()
{
outColour = texture( myTex, vec2(0,0 ) ) + texture( myFaultyTex, vec3(0,0, 0 ) );
}

The bug lies in that myTex will automatically be assigned to binding point 0; then myFaultyTex gets assigned to the same binding point 0 explicitly but with a different type (sampler array vs sampler 2D).
This is not allowed by the specs and the driver will complain.

It’s a nasty workaround (exploit); but useful nonetheless to see the actual ISA of NVIDIA cards.

It’s a shame that NVIDIA drivers don’t allow dumping it as part of the normal process (and preferably offline).

2. smz

I wonder how the GCN chips actually do? You describe GCN’s structure and show GCN assembly but the graph is using an old VLIW4 design. Is it similar, worse or better? I saw a slight improvement (relative 5%) in another geometry shader workload going from 6950 to R9 290.

• smz

Non-instancing: ~1000FPS
GS: ~970FPS
Instancing: 230FPS

All three push the GPU to 100% usage, so that’s good I guess?

• Joshua Barczak

Thanks. What card? I’ll add this to the post.

• smz

R9 290.

3. Anonymous Coward

This is why there are GS invocations/instancing (which you can get with ARB_gpu_shader5, not sure if DX10 has them, DX11 definitely should). Basically you’d do the point transform in VS, and output your v[8] as a varying, and have 14 GS invocations, each of which emits a single box. That should perform better, I think.

4. That was a good read. The geometry shader is writing into memory on some platforms … one of the work arounds is to try to re-write every geometry shader based algorithm as vertex shader instancing.

The general advice for many platforms is that one doesn’t want to use the geometry shader at all since a few years. This changed over time with different hardware architectures. As far as I can tell the three hardware vendors you mention didn’t have any consistent support of the geometry shader over the years. It changed. The trend is going away from the geometry shader usage ….

On a related note: we should just drop all the different shader stages and use instead an extended compute shader that exposes all the functionality … and just does a couple of runs …

• Joshua Barczak

I completely agree. It’s just going to take a lot of prodding to get the IHVs on board.

5. Amazing article! Thanks for sharing!

I wonder if the geometry shader version will perform still as horrific if you only output the three visible sides of the cube. This would reduce the bandwidth pressure a bit and cut down the number of triangles for the rasterizer to half. Then again this would introduce a few branches and does not help the geometry shader stage performance itself much …