Why Geometry Shaders Are Slow (Unless you’re Intel)

Let’s look at a simple, constrained test case. Suppose I want to render a very large number of oriented boxes. That makes for a rather boring scene, yes, but there are legitimate use cases in which I might want to do it. I might have a deferred renderer that’s rendering light bounding volumes, or maybe I’m doing an object space AO technique like Oat, Shopf, and I published in ShaderX7, under the title: “Deferred Occclusion From Analytic Surfaces”. Or, maybe I’m rendering lots of bounding volumes for occlusion culling. Let’s just pretend for a moment that this is an important workload. How do I do this as fast as possible?

For our experiments, we’ll render the boxes using a simple pixel shader the uses derivative trickery to extract a normal, and then does diffuse lighting. This is only really to give us something interesting to see and to make sure that we didn’t mangle the geometry:

row_major float4x4 g_RasterToWorld;
float4 main( float4 v : SV_Position ) : SV_Target
{
    float4 p = mul( float4(v.xyz,1), g_RasterToWorld);
    p.xyz/= p.w;
    float3 n = normalize( cross( ddx(p.xyz), ddy(p.xyz) ) );
    float3 L = normalize(float3(1,1,-1));
    float  d = saturate(dot(n,L));
    return float4(d.xxx,1);
}

We’ll also pick a viewpoint so that most of the boxes are off screen. I’ve repeated this for all boxes offscreen and gotten essentially the same result. You can find all of the code here. I built it on my laptop using VC++ express 2013. If you’re lucky, it might work.

Let’s consider three ways of rendering the boxes:

Method 1: The Obvious

An oriented box can be defined by a 3×4 transform matrix which deforms a unit cube into the box of interest. The columns of this matrix contain the center and axes of our box, as shown below:

\begin{bmatrix}  X_x & Y_x & Z_x & C_x \\  X_y & Y_y & Z_y & C_y \\  X_z & Y_z & Z_z & C_z \\\end{bmatrix}

So, let’s just render a bunch of instanced cubes. We have one unit box mesh, and a buffer packed full of transform matrices, and we use an ordinary instanced drawcall on it. Our vertex shader looks like this:

uniform row_major float4x4 g_ViewProj;
 
float4 main( 
    float4 v : POSITION, 
    float4 R0 : XFORM0,  // R0,R1,R2 are a 3x4 transform matrix
    float4 R1 : XFORM1, 
    float4 R2 : XFORM2 ) : SV_Position
{
    // deform unit box into desired oriented box
    float3 vPosWS = float3( dot(v,R0), dot(v,R1), dot(v,R2) );
 
    // clip-space transform
    return mul( float4(vPosWS,1), g_ViewProj );
}

I’m using float4s and dot products instead of a float3x4 matrix type, because TBH I can never keep N and M straight in my head, and the dot products are just easier on my brain.

Method 2: The “Instancing Sucks” Way

Let’s do the same thing, except let’s not use instancing. The reasons for this will become clear shortly. Now, we don’t want to go doing one drawcall per box, because that would just introduce more overhead. We also don’t want to duplicate the same unit cube 250K times. That would consume unnecessary bandwidth. Instead, we’ll do this by generating a gigantic index buffer and doing SV_VertexID math. We know that cube i will reference vertices 8*i through 8*i+7, so we can figure out our own local instance and vertex ID from a flat index buffer. The only drawback is that now we need to fetch our vertex and instance data explicitly:

 
uniform row_major float4x4 g_ViewProj;
 
Buffer<float4> Verts;
Buffer<float4> XForms;
 
float4 main( 
    uint vid : SV_VertexID ) : SV_Position
{
    uint xform = vid/8;
    float4 v  = Verts[vid%8];
    float4 R0 = XForms[3*xform];
    float4 R1 = XForms[3*xform+1];
    float4 R2 = XForms[3*xform+2];
 
    // deform unit box into desired oriented box
    float3 vPosWS = float3( dot(v,R0), dot(v,R1), dot(v,R2) );
 
    // clip-space transform
    return mul( float4(vPosWS,1), g_ViewProj );
}

Method 3: The Clever Way

Let’s think outside the box about our boxes for a second. What is it that we have to do in order to draw a box? We need to compute the clip space positions of each of its 8 vertices. That’s all. So far, we’ve done this by doing a pair of matrix multiplications. We deform a unit cube into our oriented box, and then apply our view-projection matrix to the result. Like so:

 v = M_{vp} *\begin{bmatrix}  X_x & Y_x & Z_x & C_x \\  X_y & Y_y & Z_y & C_y \\  X_z & Y_z & Z_z & C_z \\\end{bmatrix}* \begin{bmatrix} \pm 1 \\ \pm 1 \\ \pm 1 \\ 1 \end{bmatrix}

If we account for the fact that our vertex coordinates are all +-1, then we can boil the world matrix down to a series of vector adds and subtracts, like so:

 v = M_{vp} * [ \begin{bmatrix} C_x \\ C_y \\ C_z \\ 1 \end{bmatrix}  \pm   \begin{bmatrix} X_x \\ X_y \\ X_z \\ 0 \end{bmatrix}   \pm    \begin{bmatrix} Y_x \\ Y_y \\ Y_z \\ 0 \end{bmatrix}   \pm    \begin{bmatrix} Z_x \\ Z_y \\ Z_z \\ 0 \end{bmatrix} ]

Now we can exploit the distributive property of matrix multiplication and factor out the view-projection transform, like so:

 v = M_{vp}C \pm M_{vp}X \pm M_{vp}Y \pm M_{vp}Z

The geometric interpretation of this is that instead of expanding our box in world-space, and then transforming the results to clip space, we instead pre-transform its basis vectors and center-point, and then do the expansion in clip space. By factoring things this way we can eliminate quite a few subexpressions.

The final piece of the puzzle is to minimize the number of verts that our GS emits. Some clever individual figured out how to represent a cube using a 14 vertex strip. That’s a considerable improvement over the 24 verts we might emit if we did it 2 triangles at a time. Here is the full geometry shader:

struct GSIn
{
    float4 R0 : XFORM0;
    float4 R1 : XFORM1;
    float4 R2 : XFORM2;
};
struct GSOut
{
    float4 v : SV_Position;
};
 
void emit( inout TriangleStream<GSOut> triStream, float4 v )
{
    GSOut s;
    s.v = v;
    triStream.Append(s);
}
 
uniform row_major float4x4 g_ViewProj;
void GenerateTransformedBox( out float4 v[8], float4 R0, float4 R1, float4 R2 )
{
    float4 center =float4( R0.w,R1.w,R2.w,1);
    float4 X = float4( R0.x,R1.x,R2.x,0);
    float4 Y = float4( R0.y,R1.y,R2.y,0);
    float4 Z = float4( R0.z,R1.z,R2.z,0);
    center = mul( center, g_ViewProj );
    X = mul( X, g_ViewProj );
    Y = mul( Y, g_ViewProj );
    Z = mul( Z, g_ViewProj );
 
    float4 t1 = center - X - Z ;
    float4 t2 = center + X - Z ;
    float4 t3 = center - X + Z ;
    float4 t4 = center + X + Z ;
    v[0] = t1 + Y;
    v[1] = t2 + Y;
    v[2] = t3 + Y;
    v[3] = t4 + Y;
    v[4] = t1 - Y;
    v[5] = t2 - Y;
    v[6] = t4 - Y;
    v[7] = t3 - Y;
}
// http://www.asmcommunity.net/forums/topic/?id=6284
static const int INDICES[14] =
{
   4, 3, 7, 8, 5, 3, 1, 4, 2, 7, 6, 5, 2, 1,
};
 
[maxvertexcount(14)]
void main( point GSIn box[1], inout TriangleStream<GSOut> triStream )
{
    float4 v[8];
    GenerateTransformedBox( v, box[0].R0, box[0].R1, box[0].R2 );
 
    //  Indices are off by one, so we just let the optimizer fix it
    [unroll]
    for( int i=0; i<14; i++ )
        emit(triStream, v[INDICES[i]-1] );
}

A Rather Deceptive Static Analysis

Here is the DX bytecode for our uninstanced vertex shader:

vs_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer cb0[4], immediateIndexed
dcl_input v0.xyzw
dcl_input v1.xyzw
dcl_input v2.xyzw
dcl_input v3.xyzw
dcl_output_siv o0.xyzw, position
dcl_temps 2
dp4 r0.x, v0.xyzw, v2.xyzw
mul r0.xyzw, r0.xxxx, cb0[1].xyzw
dp4 r1.x, v0.xyzw, v1.xyzw
mad r0.xyzw, r1.xxxx, cb0[0].xyzw, r0.xyzw
dp4 r1.x, v0.xyzw, v3.xyzw
mad r0.xyzw, r1.xxxx, cb0[2].xyzw, r0.xyzw
add o0.xyzw, r0.xyzw, cb0[3].xyzw
ret

If we count flops, we find that our vertex shader contains 28 flops. A dp4 is a mul followed by 3 mads, and a mad is one flop regardless of what marketing people would like you to believe. Multiply that by 8, and we get a total of 224 flops per box to transform a box. We’ve only got 8 verts so the post TnL cache will remove the duplication.

Here is the geometry shader:

gs_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer cb0[4], immediateIndexed
dcl_input v[1][0].xyzw
dcl_input v[1][1].xyzw
dcl_input v[1][2].xyzw
dcl_temps 7
dcl_inputprimitive point 
dcl_stream m0
dcl_outputtopology trianglestrip 
dcl_output_siv o0.xyzw, position
dcl_maxout 14
mul r0.xyzw, cb0[1].xyzw, v[0][1].wwww
mad r0.xyzw, v[0][0].wwww, cb0[0].xyzw, r0.xyzw
mad r0.xyzw, v[0][2].wwww, cb0[2].xyzw, r0.xyzw
add r0.xyzw, r0.xyzw, cb0[3].xyzw
mul r1.xyzw, cb0[1].xyzw, v[0][1].xxxx
mad r1.xyzw, v[0][0].xxxx, cb0[0].xyzw, r1.xyzw
mad r1.xyzw, v[0][2].xxxx, cb0[2].xyzw, r1.xyzw
add r2.xyzw, r0.xyzw, r1.xyzw
add r0.xyzw, r0.xyzw, -r1.xyzw
mul r1.xyzw, cb0[1].xyzw, v[0][1].zzzz
mad r1.xyzw, v[0][0].zzzz, cb0[0].xyzw, r1.xyzw
mad r1.xyzw, v[0][2].zzzz, cb0[2].xyzw, r1.xyzw
add r3.xyzw, r1.xyzw, r2.xyzw
add r2.xyzw, -r1.xyzw, r2.xyzw
mul r4.xyzw, cb0[1].xyzw, v[0][1].yyyy
mad r4.xyzw, v[0][0].yyyy, cb0[0].xyzw, r4.xyzw
mad r4.xyzw, v[0][2].yyyy, cb0[2].xyzw, r4.xyzw
add r5.xyzw, r3.xyzw, r4.xyzw
add r3.xyzw, r3.xyzw, -r4.xyzw
mov o0.xyzw, r5.xyzw
emit_stream m0
add r6.xyzw, r0.xyzw, r1.xyzw
add r0.xyzw, r0.xyzw, -r1.xyzw
add r1.xyzw, r4.xyzw, r6.xyzw
add r6.xyzw, -r4.xyzw, r6.xyzw
mov o0.xyzw, r1.xyzw
emit_stream m0
mov o0.xyzw, r3.xyzw
emit_stream m0
mov o0.xyzw, r6.xyzw
emit_stream m0
add r6.xyzw, -r4.xyzw, r0.xyzw
add r0.xyzw, r4.xyzw, r0.xyzw
mov o0.xyzw, r6.xyzw
emit_stream m0
mov o0.xyzw, r1.xyzw
emit_stream m0
mov o0.xyzw, r0.xyzw
emit_stream m0
mov o0.xyzw, r5.xyzw
emit_stream m0
add r1.xyzw, r2.xyzw, r4.xyzw
add r2.xyzw, r2.xyzw, -r4.xyzw
mov o0.xyzw, r1.xyzw
emit_stream m0
mov o0.xyzw, r3.xyzw
emit_stream m0
mov o0.xyzw, r2.xyzw
emit_stream m0
mov o0.xyzw, r6.xyzw
emit_stream m0 
mov o0.xyzw, r1.xyzw
emit_stream m0
mov o0.xyzw, r0.xyzw
emit_stream m0
ret

If we look at just the math, our total here is 108 flops per box. I’m going to assume that the driver will get rid of those idiotic movs, but it’s hard to tell how much those ‘emit’ things actually cost. Still, we’re only emitting 56 dwords, so even if its around one ‘flop’ per dword, we’re still going to get a really nice speedup, or so we think.

Results

I’ve run this on three different machines. An Nvidia GTX 670, an old AMD A10 APU (Radeon HD7760G, a VLIW-4 chip), and the Haswell Graphics chip in my personal laptop (Core i3-4010U). In fairness to AMD and Nvidia, I’m not using their latest and greatest architectures, so the comparison isn’t exactly fair, and I’d be happy to post updated results as I receive them. Still, my point here is to illustrate the pitfalls of geometry shaders, not to turn this into some sort of IHV shootout. These GPUs serve that purpose well enough. Results are normalized below so that nobody looks bad:

Update: Added R9 290 results from commenter smz.

graph

The first thing we notice is that instancing sucks. If you are rendering a very small number of verts per instance, you want to do it without instancing, because all three geometry pipelines take a hit. Especially those poor blue fellas. This result is just plain irritating. In order to avoid whatever silly bottleneck this is, it’s necessary for us to create a redundant index buffer and add a few spurious instructions to the vertex shader. It’s counter-intuitive, and a tad wasteful. They probably need to flush the post-transform cache between instances, but they could at least try and mitigate this by adding a few instance ID bits to the cache tag. I’ll put up with a lower maximum index value if that helps.

The second thing we notice is that our GS idea was not as clever as we thought. On both the AMD and Nvidia parts, our clever idea, which was supposed to cut our workload in half, has instead hurt us. To understand why, let’s dig a little deeper.

Why Half As Much Math Goes Slower

At this point we really need to see some actual shader assembly, but unfortunately, only one architecture lets us do this. So, I’ll do what I do during my day job. I will assume that everybody’s hardware is exactly the same as GCN, and make all of my shader design decisions accordingly. If the blue and green teams are made nervous by that, then perhaps they should start listening to my incessant whining and give me an offline bytecode compiler. :)

Here’s the GCN shader. The syntax looks different from what you’re used to, because I’m using my renegade disassembler.

            S_MOV_B32     M0,    s9
        S_LOAD_DWORDX4 s12[4],  s0[2],     48
            S_MOVK_I32     s2,  1792
         V_LSHLREV_B32     v0,     2,    v0
            S_MOVK_I32     s3,  1024
            S_MOVK_I32     s9,   768
            S_MOVK_I32    s10,  1536
            S_MOVK_I32    s11,  2816
            S_MOVK_I32    s16,  1280
            S_MOVK_I32    s17,   512
            S_MOVK_I32    s18,  2048
            S_MOVK_I32    s19,   256
            S_MOVK_I32    s20,  2560
             S_WAITCNT     vmcnt(15) expcnt(7) lkgmcnt(0)
     BUFFER_LOAD_DWORD     v1,   s12[4] [s2+v0] GLC+SLC
            S_MOVK_I32     s2,  2304
                 S_NOP(1)
     BUFFER_LOAD_DWORD     v2,   s12[4] [s3+v0] GLC+SLC
     BUFFER_LOAD_DWORD     v3,   s12[4] [s9+v0] GLC+SLC
     BUFFER_LOAD_DWORD     v4,   s12[4] [s10+v0] GLC+SLC
     BUFFER_LOAD_DWORD     v5,   s12[4] [v0] GLC+SLC
     BUFFER_LOAD_DWORD     v6,   s12[4] [s11+v0] GLC+SLC
     BUFFER_LOAD_DWORD     v7,   s12[4] [s16+v0] GLC+SLC
     BUFFER_LOAD_DWORD     v8,   s12[4] [s17+v0] GLC+SLC
     BUFFER_LOAD_DWORD     v9,   s12[4] [s18+v0] GLC+SLC
     BUFFER_LOAD_DWORD    v10,   s12[4] [s19+v0] GLC+SLC
     BUFFER_LOAD_DWORD    v11,   s12[4] [s20+v0] GLC+SLC
     BUFFER_LOAD_DWORD     v0,   s12[4] [s2+v0] GLC+SLC
 S_BUFFER_LOAD_DWORDX4 s12[4],  s4[2],     16
 S_BUFFER_LOAD_DWORDX4 s16[4],  s4[2],      0
 S_BUFFER_LOAD_DWORDX4 s20[4],  s4[2],     32
 S_BUFFER_LOAD_DWORDX4  s4[4],  s4[2],     48
             S_WAITCNT     vmcnt(11) expcnt(7) lkgmcnt(0)
             V_MUL_F32    v12,   s12,    v1
             V_MUL_F32    v13,   s13,    v1
             S_WAITCNT     vmcnt(9) expcnt(7) lkgmcnt(15)
             V_MAC_F32    v12,   s16,    v3
             V_MUL_F32    v14,   s12,    v2
             V_MUL_F32    v15,   s14,    v1
             V_MAC_F32    v13,   s17,    v3
             S_WAITCNT     vmcnt(6) expcnt(7) lkgmcnt(15)
             V_MAC_F32    v12,   s20,    v6
             V_MUL_F32    v16,   s13,    v2
             V_MAC_F32    v14,   s16,    v5
             V_MUL_F32    v17,   s12,    v4
             V_MUL_F32     v1,   s15,    v1
             V_MAC_F32    v15,   s18,    v3
             V_MAC_F32    v13,   s21,    v6
             V_ADD_F32    v12,    s4,   v12
             V_MUL_F32    v18,   s14,    v2
             V_MAC_F32    v16,   s17,    v5
             S_WAITCNT     vmcnt(3) expcnt(7) lkgmcnt(15)
             V_MAC_F32    v14,   s20,    v9
             V_MUL_F32    v19,   s13,    v4
             V_MAC_F32    v17,   s16,    v8
             V_MUL_F32    v20,   s12,    v7
             V_MAC_F32     v1,   s19,    v3
             V_MAC_F32    v15,   s22,    v6
             V_ADD_F32     v3,    s5,   v13
             V_MUL_F32     v2,   s15,    v2
             V_MAC_F32    v18,   s18,    v5
             V_MAC_F32    v16,   s21,    v9
             V_ADD_F32    v13,   v12,   v14
             V_MUL_F32    v21,   s14,    v4
             V_MAC_F32    v19,   s17,    v8
             S_WAITCNT     vmcnt(1) expcnt(7) lkgmcnt(15)
             V_MAC_F32    v17,   s20,   v11
             V_MUL_F32    v22,   s13,    v7
             V_MAC_F32    v20,   s16,   v10
        S_LOAD_DWORDX4  s0[4],  s0[2],     64
             V_MAC_F32     v1,   s23,    v6
             V_ADD_F32     v6,    s6,   v15
             V_MAC_F32     v2,   s19,    v5
             V_MAC_F32    v18,   s22,    v9
             V_ADD_F32     v5,    v3,   v16
             V_MUL_F32     v4,   s15,    v4
             V_MAC_F32    v21,   s18,    v8
             V_MAC_F32    v19,   s21,   v11
             V_ADD_F32    v15,   v13,   v17
             V_MUL_F32    v23,   s14,    v7
             V_MAC_F32    v22,   s17,   v10
             S_WAITCNT     vmcnt(0) expcnt(7) lkgmcnt(15)
             V_MAC_F32    v20,   s20,    v0
             V_ADD_F32     v1,    s7,    v1
             V_MAC_F32     v2,   s23,    v9
             V_ADD_F32     v9,    v6,   v18
             V_MAC_F32     v4,   s19,    v8
             V_MAC_F32    v21,   s22,   v11
             V_ADD_F32     v8,    v5,   v19
             V_MUL_F32     v7,   s15,    v7
             V_MAC_F32    v23,   s18,   v10
             V_MAC_F32    v22,   s21,    v0
             V_ADD_F32    v24,   v15,   v20
             V_ADD_F32    v25,    v1,    v2
             V_MAC_F32     v4,   s23,   v11
             V_ADD_F32    v11,    v9,   v21
             V_MAC_F32     v7,   s19,   v10
             V_MAC_F32    v23,   s22,    v0
             V_ADD_F32    v10,    v8,   v22
             V_ADD_F32    v26,   v25,    v4
             V_MAC_F32     v7,   s23,    v0
             V_ADD_F32     v0,   v11,   v23
             V_SUB_F32    v12,   v12,   v14
             V_ADD_F32    v14,   v26,    v7
             V_SUB_F32     v3,    v3,   v16
             S_WAITCNT     vmcnt(15) expcnt(7) lkgmcnt(0)
    BUFFER_STORE_DWORD    v24,    s0[4] [s8] GLC+SLC
             V_ADD_F32    v16,   v12,   v17
             V_SUB_F32     v6,    v6,   v18
    BUFFER_STORE_DWORD    v10,    s0[4] [s8+56] GLC+SLC
             V_ADD_F32    v18,    v3,   v19
             V_ADD_F32    v27,   v20,   v16
             V_SUB_F32     v1,    v1,    v2
    BUFFER_STORE_DWORD     v0,    s0[4] [s8+112] GLC+SLC
             V_ADD_F32     v2,    v6,   v21
             V_ADD_F32    v28,   v22,   v18
    BUFFER_STORE_DWORD    v14,    s0[4] [s8+168] GLC+SLC
             V_ADD_F32    v29,    v1,    v4
             V_ADD_F32    v30,   v23,    v2
             S_SENDMSG  GS:  EMIT 
             V_ADD_F32    v31,    v7,   v29
    BUFFER_STORE_DWORD    v27,    s0[4] [s8+4] GLC+SLC
             V_SUB_F32    v15,   v15,   v20
    BUFFER_STORE_DWORD    v28,    s0[4] [s8+60] GLC+SLC
             V_SUB_F32     v8,    v8,   v22
    BUFFER_STORE_DWORD    v30,    s0[4] [s8+116] GLC+SLC
             V_SUB_F32    v11,   v11,   v23
    BUFFER_STORE_DWORD    v31,    s0[4] [s8+172] GLC+SLC
             V_SUB_F32    v26,   v26,    v7
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v15,    s0[4] [s8+8] GLC+SLC
          V_SUBREV_F32    v16,   v20,   v16
    BUFFER_STORE_DWORD     v8,    s0[4] [s8+64] GLC+SLC
          V_SUBREV_F32    v18,   v22,   v18
    BUFFER_STORE_DWORD    v11,    s0[4] [s8+120] GLC+SLC
          V_SUBREV_F32     v2,   v23,    v2
    BUFFER_STORE_DWORD    v26,    s0[4] [s8+176] GLC+SLC
          V_SUBREV_F32    v29,    v7,   v29
             S_SENDMSG  GS:  EMIT 
             V_SUB_F32    v12,   v12,   v17
    BUFFER_STORE_DWORD    v16,    s0[4] [s8+12] GLC+SLC
             V_SUB_F32     v3,    v3,   v19
    BUFFER_STORE_DWORD    v18,    s0[4] [s8+68] GLC+SLC
             S_WAITCNT     vmcnt(15) expcnt(1) lkgmcnt(15)
          V_SUBREV_F32    v16,   v20,   v12
             V_SUB_F32     v6,    v6,   v21
    BUFFER_STORE_DWORD     v2,    s0[4] [s8+124] GLC+SLC
             S_WAITCNT     vmcnt(15) expcnt(0) lkgmcnt(15)
          V_SUBREV_F32     v2,   v22,    v3
             V_SUB_F32     v1,    v1,    v4
    BUFFER_STORE_DWORD    v29,    s0[4] [s8+180] GLC+SLC
          V_SUBREV_F32    v18,   v23,    v6
             S_SENDMSG  GS:  EMIT 
             S_WAITCNT     vmcnt(15) expcnt(0) lkgmcnt(15)
          V_SUBREV_F32    v29,    v7,    v1
    BUFFER_STORE_DWORD    v16,    s0[4] [s8+16] GLC+SLC
    BUFFER_STORE_DWORD     v2,    s0[4] [s8+72] GLC+SLC
    BUFFER_STORE_DWORD    v18,    s0[4] [s8+128] GLC+SLC
    BUFFER_STORE_DWORD    v29,    s0[4] [s8+184] GLC+SLC
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v27,    s0[4] [s8+20] GLC+SLC
             V_ADD_F32    v12,   v20,   v12
    BUFFER_STORE_DWORD    v28,    s0[4] [s8+76] GLC+SLC
             V_ADD_F32     v3,   v22,    v3
    BUFFER_STORE_DWORD    v30,    s0[4] [s8+132] GLC+SLC
             V_ADD_F32     v6,   v23,    v6
    BUFFER_STORE_DWORD    v31,    s0[4] [s8+188] GLC+SLC
             V_ADD_F32     v1,    v7,    v1
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v12,    s0[4] [s8+24] GLC+SLC
    BUFFER_STORE_DWORD     v3,    s0[4] [s8+80] GLC+SLC
    BUFFER_STORE_DWORD     v6,    s0[4] [s8+136] GLC+SLC
    BUFFER_STORE_DWORD     v1,    s0[4] [s8+192] GLC+SLC
             S_SENDMSG  GS:  EMIT 
             V_SUB_F32    v13,   v13,   v17
    BUFFER_STORE_DWORD    v24,    s0[4] [s8+28] GLC+SLC
             V_SUB_F32     v5,    v5,   v19
    BUFFER_STORE_DWORD    v10,    s0[4] [s8+84] GLC+SLC
             S_WAITCNT     vmcnt(15) expcnt(0) lkgmcnt(15)
             V_ADD_F32    v10,   v13,   v20
             V_SUB_F32     v9,    v9,   v21
    BUFFER_STORE_DWORD     v0,    s0[4] [s8+140] GLC+SLC
             S_WAITCNT     vmcnt(15) expcnt(0) lkgmcnt(15)
             V_ADD_F32     v0,    v5,   v22
             V_SUB_F32     v4,   v25,    v4
    BUFFER_STORE_DWORD    v14,    s0[4] [s8+196] GLC+SLC
             S_WAITCNT     vmcnt(15) expcnt(0) lkgmcnt(15)
             V_ADD_F32    v14,    v9,   v23
             S_SENDMSG  GS:  EMIT 
             V_ADD_F32    v17,    v4,    v7
    BUFFER_STORE_DWORD    v10,    s0[4] [s8+32] GLC+SLC
    BUFFER_STORE_DWORD     v0,    s0[4] [s8+88] GLC+SLC
    BUFFER_STORE_DWORD    v14,    s0[4] [s8+144] GLC+SLC
    BUFFER_STORE_DWORD    v17,    s0[4] [s8+200] GLC+SLC
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v15,    s0[4] [s8+36] GLC+SLC
             V_SUB_F32    v13,   v13,   v20
    BUFFER_STORE_DWORD     v8,    s0[4] [s8+92] GLC+SLC
             V_SUB_F32     v5,    v5,   v22
    BUFFER_STORE_DWORD    v11,    s0[4] [s8+148] GLC+SLC
             S_WAITCNT     vmcnt(15) expcnt(1) lkgmcnt(15)
             V_SUB_F32     v8,    v9,   v23
    BUFFER_STORE_DWORD    v26,    s0[4] [s8+204] GLC+SLC
             V_SUB_F32     v4,    v4,    v7
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v13,    s0[4] [s8+40] GLC+SLC
    BUFFER_STORE_DWORD     v5,    s0[4] [s8+96] GLC+SLC
    BUFFER_STORE_DWORD     v8,    s0[4] [s8+152] GLC+SLC
    BUFFER_STORE_DWORD     v4,    s0[4] [s8+208] GLC+SLC
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v16,    s0[4] [s8+44] GLC+SLC
    BUFFER_STORE_DWORD     v2,    s0[4] [s8+100] GLC+SLC
    BUFFER_STORE_DWORD    v18,    s0[4] [s8+156] GLC+SLC
    BUFFER_STORE_DWORD    v29,    s0[4] [s8+212] GLC+SLC
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v10,    s0[4] [s8+48] GLC+SLC
    BUFFER_STORE_DWORD     v0,    s0[4] [s8+104] GLC+SLC
    BUFFER_STORE_DWORD    v14,    s0[4] [s8+160] GLC+SLC
    BUFFER_STORE_DWORD    v17,    s0[4] [s8+216] GLC+SLC
             S_SENDMSG  GS:  EMIT 
    BUFFER_STORE_DWORD    v12,    s0[4] [s8+52] GLC+SLC
    BUFFER_STORE_DWORD     v3,    s0[4] [s8+108] GLC+SLC
    BUFFER_STORE_DWORD     v6,    s0[4] [s8+164] GLC+SLC
    BUFFER_STORE_DWORD     v1,    s0[4] [s8+220] GLC+SLC
             S_SENDMSG  GS:  EMIT 
             S_WAITCNT     vmcnt(0) expcnt(7) lkgmcnt(15)
             S_SENDMSG  GS: DONE
              S_ENDPGM

By looking at the disassembly we can see something interesting. The shader is writing all of its output to memory. That’s right. Every vertex we emit from an AMD geometry shader has to make a round-trip through memory. Now, before you go bashing AMD for this, think about why it is they might be doing this.

The API requires that the output of a geometry shader be rendered in input order. The fixed-function hardware on the other side is required to consume geometry shader outputs serially. This creates a sync point. If we want to process multiple primitives in parallel, it is necessary for GS instances to buffer up their outputs so that they can be fed in the correct order to whoever is consuming them. The more parallelism, and the more verts our GS emits, the more buffering we need.

Recall that GPU shader pipelines operate in SIMD fashion. The amount of buffering we need is determined by the SIMD width. AMD’s SIMD is 64 threads wide, which means that in our case they must buffer 14336 bytes for every GS wave. For Nvidia, it’s 7168 bytes per warp. On a state of the art R9 with 40 CUs, we need at least 160 waves just to keep all of the schedulers occupied, which translates to over 2MB of buffering, and you’ll still need more than that because 1 per SIMD is not enough to run well.

There are only two places this buffering can exist. It’s either on chip, in a cache somewhere, or its off chip, in DRAM. If you put it on chip, you need to throttle the number of concurrent warps based on the amount of space you have, and if you put it off chip, it’s going to take that much longer for the consumer to get it, which means that unless the shader is really, really expensive there is no way you’ll be able to avoid being stalled on it. Back in the DX10 era, Nvidia went the on-chip route, and AMD went the off-chip route. I don’t think that either is particularly happy with the results.

Conversely, the vertex shader way, even though it does 2x as much work, only produces 16 bytes per thread, which translates to 1024 bytes/wave (AMD), or 512 (nv). By using a GS, we replace a large number of low-bandwidth threads with a small number of high bandwidth threads, and even though we perform less work, we still lose, because we’re not able to parallelize it as well.

Why Intel’s Geometry Shaders Don’t Suck

This is my own speculation. I’d be curious to hear how close to the mark I am.

After a thorough perusal of their linux graphics docs, it seems that their GS works by blocking threads. Each thread generates its output, puts in registers, and waits its turn to feed it downstream. If an EU has a GS thread that is blocked at the sync point, it can start executing another GS thread while the first one is waiting. As long as the GS threads are doing a goodly amount of computation, the machine stays busy. I speculate that Intel gets away with this for two reasons:

1. Unlike the competition, Intel’s shader hardware has a full set of registers dedicated to each hardware thread. The red and green team each lose thread occupancy if a shader has a lot of register pressure, but not the blue team, they just exploit their ridiculous process advantage and pack the little suckers in, and then stop worrying about it. Our shader has quite a bit of register pressure in it, but that doesn’t hurt Intel’s concurrency one bit. Their enormous register file functions as a big on-chip buffer.

2. Intel’s shader EU’s are interesting in that they can operate in a variety of modes. They’re 8 wide, and can run in SIMD-8 mode (where each 8 threads issue one operation), or SIMD16 mode (where 16 threads issue from back to back registers), or SIMD4x2 mode, where 2 threads each issue 4 operations. SIMD4x2 is used in Haswell for VS and GS, and I suspect that its the main reason for the awesomeness. Intel is only running two GS invocations at a time, and is replacing data-parallelism with instruction level parallelism, which means that their per-thread bandwidth is a measly 448 bytes, an order of magnitude lower than everybody else’s.

These two factors together mean that Intel doesn’t suffer nearly as badly from the sync point. It takes much, much less time to consume 448 bytes than 14336, which means that the wait time is bearable, and there are plenty of threads available to cook new batches while the old ones are blocked.

Do Geometry Shaders Suck?

Even though it is possible to implement geometry shaders efficiently, the fact that two of the three vendors don’t do it that way means that the GS is not a practical choice for production use. It should be avoided wherever possible.

It is flawed, in that it injects a serialized, high bandwidth operation into an already serialized part of the pipeline. It requires a lot of per-thread storage. It is clearly a very unnatural fit for wide SIMD machines. However, this little exercise has made me wonder if it can’t be redeemed by spreading a single instance across multiple warps/wavefronts, squeezing ILP out of a DLP architecture. Perhaps I’ll try and write a compute shader that does this.

Unfortunately, even if I did find a way to express my shader this way, it’s still not possible for me to USE such a shader as a GS. The APIs don’t permit it. In the future, perhaps we want a lower level model there, something like, “Here are N compute threads responsible for M input primitives.”, where N and M are both application-defined knobs. Food for thought, at least.