Thoughts on Texel Shaders

UPDATE: Second post with discussion.

The past several years have been relatively quiet on the hardware feature set front. Most of the big news has been the low-level API revolution, which, if you ask me, has been a good thing. Drawcalls are now essentially free. Now that Vulkan is on its way and Metal and DX12 are gaining traction, its time to turn our attention back to the hardware. What will be the key new graphics feature to get excited about in the future? What will we nag the HW people about now? In my opinion, that next big thing should be a Texel Shader.

The purpose of this post is to try and drum up support and provoke thought. I’m going to brainstorm what I think are some interesting and relevant use cases, and try to sketch out how the APIs and hardware could be extended to accommodate them.

Definitions

The term Texel Shader refers to a shader program which is executed on demand and computes a block of texels given integer texel locations.

The terms Procedural Texture or Virtual Texture refer to a texture of a particular size whose contents are generated on demand by a texel shader, and are passed through fixed-function filtering before being used. The contents of a Procedural Texture are addressable and can be expected to reside in cache after generation, but are not backed by memory. This is not to be confused with the Tiled Resources found in current APIs.

The term Calling Shader refers to a conventional graphics shader (VS,PS,CS, and so on) which is sampling a Procedural Texture. I will often refer to the calling shader as a PS, but it can actually be of any type.

The idea is that we define a texture whose contents are ephemeral and calculated on demand. We do a sampling operation, hardware figures out which texels we need, then goes and gets them, then filters them. The difference between a Procedural Texture and an ordinary one is that instead of hardware going and getting texels from memory, it goes and gets them by running a shader and caching the results.

Use Cases

Let us now contemplate some use cases.

Custom Formats

Suppose I happen across a use for an L7A1 texture. At best, I have to wait 2 years for it to be in somebody’s HW and another 4 or 5 years for it to go into the rest of the HW and the APIs before I can finally use it in production. During all that time, I must repeatedly champion my use case and convince people that yes, it really is a good idea to use this instead of L8A8, because a 50% size reduction is a worthwhile benefit. A texel shader would allow texels to be stored in memory in whatever non-standard form we want, and expanded into a filterable format on demand.

Non-linear data encodings

People have experimented with putting YCoCg in DXT textures. See, for example here and here

If we had a texel shader, we could do the Ycc->RGB transform on a per-texel basis. Besides amortizing a little bit of math, this has the added benefit of making the filtering correct. We could also implement tricks like BC4 luminance and sub-sampled BC5 chroma, and still the filtering is correct.

Another possibility is storing normal maps in a different way. There are some really high precision two-channel normal map encodings out there, but you can’t directly filter them because they’ll wrap around the sphere in funny ways. With a texel shader, we could transform the texels into Cartesian on the fly and then filter the results.

Channel Packing

Speaking of sub-sampling, a texel shader would allow us to take any texture, and vary the per-channel resolution in any way we liked. It’s extremely common to “channel-pack” textures, by taking various greyscale masks and jamming them into the channels of an RGBA texture. This type of setup is widely used because it allows independent values to be sampled and filtered together, which is more cache-efficient and requires fewer texture instructions than would filtering independent maps. One of the drawbacks of this approach is that all of the values have to be packed at the same resolution, which means we either have to put a tedious constraint on the artist or re-sample the inputs. A texel shader might enable us to use different resolutions for the different packed channels, upsampling the lower resolution ones on the fly, then passing the results through filtering. This would enable some memory savings while still amortizing the filtering cost.

Block Tricks

ASTC is a very interesting format. It supports a variety of different block sizes, and the intent is that an artist can pick the smallest one whose quality level they can tolerate. What if instead, we built an adaptive scheme? Use 8×8 blocks in most places, and split into four 4×4 blocks in regions of high error, with a auxiliary data structure that tells us which 8×8 blocks were split and where the corresponding 4×4 blocks are. The texel shader could access the auxilliary texture and then gather and decode the corresponding blocks.

We could also use a texel shader to implement missing BC formats. If I really like PVRTC (which I do), I could use texel shaders to let me render PVRTC content on everybody else’s GPUs. We could also invent our own BC formats. If I decided I was happy with a lower compression rate I could use 4-bit modulation and 32-bit block colors and have a really robust, high quality RGBA format with a 2:1 bit rate, without having to go cup-in-hand to hardware architects. We’d have to decompress into cache to do that, of course, but the quality boost might make it worthwhile.

Gradient Maps

The “gradient mapping” technique was very popular among artists a while back, given its use in Left 4 Dead.

The idea is that you author a single greyscale texture specifying the structure of whatever it is you’re texturing, and use the variety of color ramps to create variations on that structure cheaply and easily. For example, artist paints a black and white wood grain texture, and then creates various kinds of wood by authoring different ramps. See here for examples.

This is simple to implement in a pixel shader, but it unfortunately, it doesn’t mip correctly. If you do this the obvious way, you end up indexing the color ramp with the filtered greyscale texture, which is nowhere near the same thing as filtering the final colors.

Authoring mips for a gradient map is probably not a hard problem. I suspect that it’s sufficient to do custom mip generation by snapping to the color from the ramp texture that most closely matches the filtered color. If that doesn’t work, some sort of least-squares fitting solution probably will. However, it’s not even worth my time solving that problem at the moment, because whatever solution I come up with will be useless. There is no way to sample the mipped representation on today’s hardware without doing manual filtering.

But…. You Can Already Do This…Right?

Whenever I pitch these ideas to IHV people, I am inevitably asked two questions.

The first question is typically: Why don’t you just render out to a texture. The answer is that I now need a texture. I need physical memory to hold my texture, and bandwidth to generate it, and then more bandwidth to sample the results. This largely defeats the purpose of using some of these compression-based tricks to begin with. There is also a very good chance that I will waste work and generate large portions of the texture that I will never need to touch during the frame. The counter-point to that is something like use a uav or stencil and mark the parts you might need, and I’ll say now I need two additional passes instead of one. In the end, it boils down to the fact that this approach to the problem is simply not very scalable. Besides being more complex to implement, it also requires lots of mid-frame render target switches and corresponding synchronization, something which a lot of today’s hardware really sucks at (especially all those poor tilers).

IHVs: You do not want me thrashing back and forth between buffers and synchronizing all the time. You want me to be able to give you all my work in gigantic blobs so you can schedule it dynamically. It’s better for everyone that way.

The second question I’m asked is: Ok, why don’t you just do manual filtering in the shader? For one thing, it’s more expensive. Phenomenally so. Smart people have spent decades making very efficient hardware implementations of bilinear, trilinear, and anisotropic filtering, and if I want to use a non-standard texel encoding, I’ve got to throw all of that away and start from scratch. A lot of people don’t appreciate just how much more expensive that is. Below is HLSL code which purports to do a manual trilinear filter:

 
float2 g_fSize;
Texture2D<float> tx;
 
float4 bilerp( float2 uv, float lod )
{
	float2 coords = (uv)/exp(lod);
	coords = coords-0.5;
	float2 weights = frac(coords);
 
	int3 icoords = int3(coords,lod);
	float4 t0 = tx.Load( icoords, uint2(0,0) );
	float4 t1 = tx.Load( icoords, uint2(1,0) );
	float4 t2 = tx.Load( icoords, uint2(0,1) );
	float4 t3 = tx.Load( icoords, uint2(1,1) );
 
	return lerp( lerp(t0,t1,weights.x),
	             lerp(t2,t3,weights.x), weights.y );
}
 
float4 main( float2 uv : uv ) : SV_Target
{
        uv = uv*g_fSize;
	float2 sizes = abs(ddx(uv)+ddy(uv));
	float lod = log2( max(sizes.x,sizes.y) );
 
	float4 l0 = bilerp(uv,floor(lod));
	float4 l1 = bilerp(uv,floor(lod)+1);
	return lerp(l0,l1,frac(lod));
}

I haven’t tested that code to verify its correctness, nor have I bothered to sit down and really optimize the crap out of it, but it’s close enough for me to make my point, which is that it’s very expensive, and a pain in the neck.

Here’s the GCN disassembly from Pyramid.


             S_MOV_B64  s2[2],  EXEC
             S_WQM_B64   EXEC,  EXEC
             S_MOV_B32     M0,   s16
 S_BUFFER_LOAD_DWORDX2  s0[2], s12[4],      0 
 V_INTERP_P1_F32(a0.X)     v2 =  P10*v0 + P0
 V_INTERP_P1_F32(a0.Y)     v0 =  P10*v0 + P0
 V_INTERP_P2_F32(a0.X)     v2 += P20*v1
 V_INTERP_P2_F32(a0.Y)     v0 += P20*v1
             S_WAITCNT     lkgmcnt(0) 
             V_MUL_F32     v1,    s0,    v2   
             V_MUL_F32     v0,    s1,    v0   
        DS_SWIZZLE_B32     v0,    v2,   0x8055
        DS_SWIZZLE_B32     v0,    v3,   0x8000
        DS_SWIZZLE_B32     v0,    v4,   0x8055
        DS_SWIZZLE_B32     v0,    v5,   0x8000
        DS_SWIZZLE_B32     v0,    v6,   0x80aa
        DS_SWIZZLE_B32     v0,    v7,   0x80aa
             S_WAITCNT     lkgmcnt(4) 
             V_SUB_F32     v2,    v2,    v3   
             S_WAITCNT     lkgmcnt(2) 
             V_SUB_F32     v4,    v4,    v5   
             S_WAITCNT     lkgmcnt(1) 
             V_SUB_F32     v3,    v6,    v3   
             S_WAITCNT     lkgmcnt(0) 
             V_SUB_F32     v5,    v7,    v5   
             V_ADD_F32     v2,    v2,    v3   
             V_ADD_F32     v3,    v4,    v5   
             V_MAX_F32     v2,  |v2|,  |v3|   
             V_LOG_F32     v2,    v2   
           V_FLOOR_F32     v3,    v2   
             V_ADD_F32     v4,   1.0,    v3   
             V_MUL_F32     v5,0x3fb8aa3b,    v4   
             V_MUL_F32     v6,0x3fb8aa3b,    v3   
             V_EXP_F32     v5,    v5   
             V_EXP_F32     v6,    v6   
             V_RCP_F32     v5,    v5   
             V_RCP_F32     v6,    v6   
             V_MAD_F32     v7,    v1,    v5,  -0.5   
             V_MAD_F32     v5,    v0,    v5,  -0.5   
             V_MAD_F32     v1,    v1,    v6,  -0.5   
             V_MAD_F32     v0,    v0,    v6,  -0.5   
         V_CVT_I32_F32    v22,    v7   
         V_CVT_I32_F32    v17,    v5   
         V_CVT_I32_F32    v21,    v1   
         V_CVT_I32_F32    v29,    v0   
         V_CVT_I32_F32    v20,    v4   
             V_ADD_I32    v18,     1,   v22   
             V_ADD_I32    v19,     1,   v17   
         V_CVT_I32_F32    v27,    v3   
             V_ADD_I32    v25,     1,   v21   
             V_ADD_I32    v26,     1,   v29   
     IMAGE_LOAD_MIP(R)    v15,   v18,    T: s4[8]    UNNORM
             V_MOV_B32    v23,   v19   
             V_MOV_B32    v24,   v20   
     IMAGE_LOAD_MIP(R)    v12,   v22,    T: s4[8]    UNNORM
             V_MOV_B32    v16,   v18   
             V_MOV_B32    v18,   v20   
     IMAGE_LOAD_MIP(R)    v11,   v16,    T: s4[8]    UNNORM
             V_MOV_B32    v18,   v22   
             V_MOV_B32    v19,   v17   
     IMAGE_LOAD_MIP(R)     v4,   v18,    T: s4[8]    UNNORM
     IMAGE_LOAD_MIP(R)     v6,   v25,    T: s4[8]    UNNORM
             V_MOV_B32    v22,   v26   
             V_MOV_B32    v23,   v27   
     IMAGE_LOAD_MIP(R)     v8,   v21,    T: s4[8]    UNNORM
             V_MOV_B32    v28,   v25   
             V_MOV_B32    v30,   v27   
     IMAGE_LOAD_MIP(R)    v13,   v28,    T: s4[8]    UNNORM
             V_MOV_B32    v25,   v21   
             V_MOV_B32    v26,   v29   
     IMAGE_LOAD_MIP(R)     v3,   v25,    T: s4[8]    UNNORM
           V_FRACT_F32     v7,    v7   
           V_FRACT_F32     v1,    v1   
      V_MIN_LEGACY_F32     v9,0x3f7fffff,    v7   
       V_CMP_CLASS_F32    VCC,    v7,     3   
      V_MIN_LEGACY_F32    v10,0x3f7fffff,    v1   
       V_CMP_CLASS_F32  s0[2],    v1,     3   
         V_CNDMASK_B32     v7,    v9,    v7   
           V_FRACT_F32     v5,    v5   
             S_WAITCNT     vmcnt(6) 
             V_SUB_F32     v9,   v15,   v12   
             S_WAITCNT     vmcnt(4) 
             V_SUB_F32    v11,   v11,    v4   
         V_CNDMASK_B32     v1,   v10,    v1   
           V_FRACT_F32     v0,    v0   
             S_WAITCNT     vmcnt(2) 
             V_SUB_F32     v6,    v6,    v8   
             S_WAITCNT     vmcnt(0) 
             V_SUB_F32    v10,   v13,    v3   
      V_MIN_LEGACY_F32    v13,0x3f7fffff,    v5   
       V_CMP_CLASS_F32  s0[2],    v5,     3   
             V_MAC_F32    v12,    v7,    v9   
             V_MAC_F32     v4,    v7,   v11   
      V_MIN_LEGACY_F32     v7,0x3f7fffff,    v0   
       V_CMP_CLASS_F32    VCC,    v0,     3   
             V_MAC_F32     v8,    v1,    v6   
             V_MAC_F32     v3,    v1,   v10   
           V_FRACT_F32     v1,    v2   
         V_CNDMASK_B32     v2,   v13,    v5   
             V_SUB_F32     v5,   v12,    v4   
         V_CNDMASK_B32     v0,    v7,    v0   
             V_SUB_F32     v6,    v8,    v3   
      V_MIN_LEGACY_F32     v7,0x3f7fffff,    v1   
       V_CMP_CLASS_F32    VCC,    v1,     3   
             V_MAC_F32     v4,    v2,    v5   
             V_MAC_F32     v3,    v0,    v6   
         V_CNDMASK_B32     v0,    v7,    v1   
             V_SUB_F32     v1,    v4,    v3   
             V_MAC_F32     v3,    v0,    v1   
             S_MOV_B64   EXEC, s2[2]
   V_CVT_PKRTZ_F16_F32     v0,    v3,    v3   
                 S_NOP(1)
              EXP_FP16     v0,  v0,  v0,  v0 (MRT0,VM,DONE)
              S_ENDPGM

And here’s what the PowerVR compiler did to a GLSL version of the same thing (don’t get my started about how painful the port was):

------------------- Disassembled HW Code -------------------- 

0    : byp ft0, ft1, c0, c1
       byp ft2, c0
       cbs ft3, c0
       byp ft4, ft1
       lsl ft5, ft4, c0
       tnz p0, ft5

1    : fmul ft0, c155, i2
       mov i0, ft0;

2    : mbyp ft0, c0
       fexp ft1, i0
       mov i1, ft1;

3    : pck.s1616.rndzero ft2, i2
       mov i0, ft2;

4    : byp ft0, ft1, c0, c16
       lsl ft2, i0, c16
       cbs ft3, i0
       or ft4, _, ft2, _, c0
       asr.twb ft5, ft4, c16
       mov r12, ft5;

5    : mbyp ft0, r12
       frcp ft1, i1
       mov r16, ft0;
       mov i0, ft1;

6    : fmad ft0, i0, r24, c75.neg
       pck.s1616.rndzero ft2, ft0
       mov i2, ft0;
       mov i1, ft2;

7    : iadd16 ft0, c0.e0, i1.e0
       fmad ft1, i0, r17, c75.neg
       mov r26, ft1;
       mov i0, ft0;

8    : byp ft0, ft1, c0, c16
       lsl ft2, i0, c16
       cbs ft3, i0
       or ft4, _, ft2, _, c0
       asr.twb ft5, ft4, c16
       mov r10, ft5;

9    : iadd16 ft0, c1.e0, i1.e0
       pck.s1616.rndzero ft2, r26
       mov i3, ft2;
       mov i0, ft0;

10   : byp ft0, ft1, c0, c16
       lsl ft2, i0, c16
       cbs ft3, i0
       or ft4, _, ft2, _, c0
       asr.twb ft5, ft4, c16
       mov r14, ft5;

11   : iadd16 ft0, c0.e0, i3.e0
       mov i0, ft0;

12   : byp ft0, ft1, c0, c16
       lsl ft2, i0, c16
       cbs ft3, i0
       or ft4, _, ft2, _, c0
       asr.twb ft5, ft4, c16
       mov r11, ft5;

13   : if(!p0)
{
       br 8
}
14   : byp ft0, ft1, c0, c1
       byp ft2, c0
       cbs ft3, c0
       byp ft4, ft1
       lsl ft5, ft4, c0
       tnz p0, ft5

15   : (ignorepe)
{
       smp2d.fcnorm.replace.pplod.integerUNKNOWN:.direct drc0, sh4, r10, sh0, _, r6, 4;
}
16   : mbyp ft0, r11
       mov r15, ft0;

17   : (ignorepe)
{
       smp2d.fcnorm.replace.pplod.integerUNKNOWN:.direct drc0, sh4, r14, sh0, _, r18, 4;
}
18   : wdf drc0

19   : fadd ft0, i2.neg.flr, i2
       fadd ft1, r6.neg, r18
       mov r18, ft0;
       mov i1, ft1;

20   : iadd16 ft0, c1.e0, i3.e0
       mov i0, ft0;

21   : byp ft0, ft1, c0, c16
       lsl ft2, i0, c16
       cbs ft3, i0
       or ft4, _, ft2, _, c0
       asr.twb ft5, ft4, c16
       mov r15, ft5;

22   : mbyp ft0, r12
       mov r16.e0.e1.e2.e3, ft0
       mov r11, r15;

23   : (ignorepe)
{
       smp2d.fcnorm.replace.pplod.integerUNKNOWN:.direct drc0, sh4, r10, sh0, _, r0, 4;
}
24   : fmad ft0, i1, r18, r6
       fadd ft1, r7.neg, r19
       mov i2, ft0;
       mov i0, ft1;

25   : wdf drc0

26   : if(!p0)
{
       br 8
}
27   : (ignorepe)
{
       smp2d.fcnorm.replace.pplod.integerUNKNOWN:.direct drc0, sh4, r14, sh0, _, r10, 4;
}
28   : fmad ft0, i0, r18, r7
       fadd ft1, r8.neg, r20
       mov i0, ft0;
       mov i1, ft1;

29   : fmad ft0, i1, r18, r8
       fadd ft1, r9.neg, r21
       mov i3, ft0;
       mov i1, ft1;

30   : wdf drc0

31   : fmad ft0, i1, r18, r9
       fadd ft1, r0.neg, r10
       mov i1, ft0;
       mov r21, ft1;

32   : fmad ft0, r21, r18, r0
       fadd ft1, r1.neg, r11
       mov r6, ft0;
       mov r7, ft1;

33   : fmad ft0, r7, r18, r1
       fadd ft1, r2.neg, r12
       mov r12, ft0;
       mov r0, ft1;

34   : fmad ft0, r0, r18, r2
       fadd ft1, r3.neg, r13
       mov r7, ft0;
       mov r1, ft1;

35   : fmad ft0, r1, r18, r3
       fadd ft1, r26.neg.flr, r26
       mov r13, ft0;
       mov r15, ft1;

36   : fadd ft0, i2.neg, r6
       fadd ft1, i0.neg, r12
       mov r2, ft0;
       mov r0, ft1;

37   : fmad ft0, r2, r15, i2
       fmad ft1, r0, r15, i0
       mov i2, ft0;
       mov i0, ft1;

38   : fmad ft0, r3, r15, i3
       fmad ft1, r1, r15, i1
       mov i3, ft0;
       mov i1, ft1;

39   : lapc

40   : UNKNOWN_OP(WAS:itr).pixel.schedwdf r0, drc0, cf4, 2, 1, cf0, 

41   : fmul ft0, sh9, r0
       mov r24, ft0;

42   : mbyp ft0, c0
       fdsx ft1, r24
       mov i3, ft1;

43   : fdsy ft0, r24
       fmul ft1, sh10, r1
       mov i2, ft0;
       mov r17, ft1;

44   : fdsx ft0, r17
       fdsy ft1, r17
       mov i1, ft0;
       mov i0, ft1;

45   : fmad ft0, sh8, i2, i3
       fmad ft1, sh8, i0, i1
       mov i1, ft0;
       mov i0, ft1;

46   : mbyp ft0, i0.abs
       mbyp ft1, i1.abs
       tstmax.f32 ftt, _, ft0, ft1
       mov i0.e0.e1.e2.e3, ft1, ftt, ft0, ft1

47   : mbyp ft0, c0
       flog ft1, i0
       mov r25, ft1;

48   : fadd ft0, r25.flr, c0
       mov i2, ft0;

49   : br.anyinst -586

50   : mbyp ft0, i0
       mov r22.e0.e1.e2.e3, i2
       mov r4, ft0;

51   : fadd ft0, r25.flr, c64
       mov i2, ft0;

52   : br.anyinst -618

53   : fadd ft0, r25.neg.flr, r25
       mov r2, ft0;

54   : fadd ft0, r22.neg, i2
       fadd ft1, r4.neg, i0
       mov i2, ft0;
       mov i0, ft1;

55   : fmad ft0, r2, i2, r22
       fmad ft1, r2, i0, r4
       mov r0, ft0;
       mov r1, ft1;

56   : fmad ft0, r2, i3, r23
       fmad ft1, r2, i1, r5
       mov r2, ft0;
       mov r3, ft1;

I’d be glad to show you disassembly for other architectures but *cough* nobody lets me *cough* 🙂

After going to all that trouble I’m still stuck with crappy trilinear filtering. I have never attempted to do an aniso filter by hand, and I never will, because by the time we reach that point, we’ve already lost. This is the wrong direction. Time to turn around.

Decoupled Computation

Besides being more expensive, shader-based filtering is doing things at the wrong computational frequency. Texel shaders enable us to amortize work across PS invocations. If we’re doing some computation to transform texels before filtering, we end up doing that work 8 times per pixel if we use a manual trilerp. With a texel shader, we’ll do that work once per texel, which, assuming ordinary mip-mapped sampling and sane access patterns, works out to about once per pixel on average. So, a texel shader is not only more convenient, it’s also (potentially) more efficient, at least in terms of operation counts. This amortization of work is even more extreme if the texture is magnified. This effect may not be very significant for the examples I’ve described so far, but this is the core idea underlying the next section.

Consider the problem of rendering participating media with single-scattering and shadows. I’m not going to go through the theory here, you can read all about it in Wojciech Jarosz’s thesis.

Bart Wronski presented a nifty implementation which uses compute shaders to precompute the scattering results into a view-dependent 3D texture, which is then sampled and applied to the pixels during the final rendering pass. Wronski’s is a three pass approach, wherein one pass generates a buffer full of in-scattered lighting, and the other integrates it from back to front, and then it gets sampled during rendering.

What if, instead, we used a 3D texel shader which did ray-marching on demand? Because our texture doesn’t actually take up any memory, we can afford a much higher resolution, and because the texels are computed on demand, we can do the computation sparsely, only evaluating those texels which we know are needed during the frame, and without doing any prepass scene voxelization or other elaborate trickery to tell us which texels those are. You could apply this technique to basically anything else which casts rays (screenspace reflection/refraction come to mind).

Johan Anderson in his “open problems” talk briefly suggests another interesting idea, which is to compute texture-space sub-surface scattering on demand.

One thing I’m curious about is storing precomputed light probes using very high order SH (6 or 7) which are sparsely evaluated and then interpolated. What if we store DC, linear, and quadratic at lower resolutions than the higher-order terms, which offsets the extra memory cost a little bit. Maybe we do this hierarchically, like Chris Oat suggested long ago. A hierarchical representation is a touchy thing to evaluate in a pixel shader, but with a texel shader we can simply point-sample the thing at whatever frequency we like and interpolate the results.

VPLs anyone? We could scatter them all over the place, bin them, and SH project them into a large virtual 3D texture on the fly, at whatever resolution we liked, and with no more memory overhead than the cost of the VPLs themselves, and whatever Forward-plussy data structure we’re storing them in. Might be cheaper than doing them all per pixel.

How about sparse distance field evaluation?

How about random access rendering of vector graphics?

There are many, many interesting possibilities. In short, a texel shader is not just a complicated feature with limited scope, it is the first step on the road to a much better graphics pipeline. This, IMO, is the real reason to do it, not so much for the applications (though these are significant), as for the places it will enable us to go afterwards.

How This Would Work

Now that I’ve (hopefully) convinced you of the merits of a texel shader, I’ll try to convince you that it can be built without totally breaking the hardware.

DISCLAIMER: This section consists entirely of my own speculation, and is drawn from my own thinking and from conversations on Twitter. Any resemblance to any actual hardware design is entirely coincidental and serves only to demonstrate that I possess awesome predictive powers. It is much more likely that I do not.

Overview

If you squint and tilt your head, today’s texture pipeline looks like this:

today_texture

We have silicon that knows how to turn UV coordinates into memory addresses, other silicon that knows how to turn addresses into cache lines full of texels, and more silicon that knows how to read the cache and filter things.

We could implement procedural textures by leveraging the same machinery. Each virtual texture could be mapped into a range of address space. There would not need to be physical memory backing this address space, we just need an address. We already have the hardware that can generate texel addresses and filter texels, that part wouldn’t need to change. We also have hardware that knows how to read the cache and detect a miss, we would need to take that hardware and give it the ability to do its job by launching shader threads instead of doing memory transactions.

When its done, the shader thread writes the resulting texels into the texture cache, and the sampling logic then pulls them out of the cache and filters them just like it did before. The texels can then remain in the cache until its time to evict them, so that subsequent fetches can re-use the results. Once the texels are evicted, they can simply disappear until they’re called for again. There’s no need for them to ever go to memory.

The new pipeline looks like this:

tomorrow_texture

I’ve diagrammed this with the shader writing back to the L1 cache, because it seems to make intuitive sense. L1 is where the filtering goes through, and its the place that the shader has the most bandwidth in and out of, and we want the latency on these texels to be as small as possible. I’m sortof assuming that the data stays in L1 and doesn’t fall out to L2, which means that different CUs/SMs/Slices/USSEs can’t share texel shader results between them, which might hurt a bit, but it also means that more of the L2 is free to hold data that is pulled in by the texel shader, instead of being cluttered up with computed texels that are also resident in L1. It also means that there’s no need to do any global synchronization or cache-coherency stuff on the texel shader results. They can just get written locally to the L1 and everything will be fine.

Whether or not that’s the right decision is a question for some hardware architect somewhere. Perhaps one of them will chime in and tell me why I’m wrong. I’d welcome the opportunity to learn from the wizards.

Scheduling

It’s easy for us to say “the hardware can just spawn a thread”, but its much harder to realize that in practice. If we have a pixel shader that wants to go off and run a texel shader, it is necessary to keep the pixel shader’s state around somehow in order for it to be able to resume execution. What do we do with all of this state? We could just keep it all on chip, let it sit in the same registers its presently occupying. This is how texture fetches are handled today, but registers used to hold the caller’s state are registers which cannot be used to run the texel shader. This is a problem, because the PS thread has a dependency on the TS and cannot clear out until the TS has finished running, and in order to run, it needs its registers. We need to somehow guarantee that a TS thread can always run when required, or we’re hung.

Here are a few approaches I can think of. There may be others I haven’t thought about. Which option we choose has implications for the API, programming model, and texture unit design, so we’ll need to decide now.

Option 1: Throttle shader execution to make sure a TS can always fit

The idea here is to hold back some of the threads, registers, whateverses and reserve them for use by the TS when it’s time comes. This guarantees forward progress, but it also means we might lose some efficiency. We’d be reserving TS resources which might otherwise be used to execute other shaders. If the TS is lightweight, or the cache hit rate is unusually high, then we might lose some of that performance. Recursive TSes would work with this approach, but every level of recursion means more additional resources that we have to hold back.

Option 2: Split shaders into phases

The next option is to cut our calling shaders up into sub-shaders, each of which ends with one or more procedural texture lookups (except the last one). The hardware executes one stage to completion, figures out which texels are touched by the sampling operation, shades them, filters them, and then launches new threads to execute the next stage. The filtered result of the previous stage’s fetches are passed as an input parameter to the next stage.

This approach has the advantage that different stages could be mixed and matched in the API without too much headache. It gives the hardware considerable freedom in scheduling the texel processing, and also allows it to schedule the cache space that’s used to hold the texel shader results. TS results can be evicted from the cache straight away once the downstream thread is finished.

This is probably my least favorite option.

First off, its problematic to do things like accumulating multiple taps from the same procedural texture. This requires the ability to pass intermediate results between stages alongside the texel results. Probably not all that difficult, but annoying.

Also annoying is the fact that any sub-expressions in the shader which are needed on both sides of the procedural fetch would need to either be passed manually between stages or recomputed.

It’s also very difficult to see how control flow would work with this model. How would you even begin to write a loop containing a procedural fetch? You’d probably need to put the entire loop body into its own sequence of stages, and add the ability to dispatch the different stages dynamically. That actually sounds like a feature I’d like to have, but not for this purpose and definitely not as a kludge solution to a design problem.

This model seems like the ideal answer to the problem from a hardware POV, but it comes with a lot of uncomfortable wrinkles which leak into software.

Option 3: Recycle the calling thread

This idea can be viewed as a variant of the last one, except that this time it’s compiler folks who get thrown under the bus.

We’d require that the entire TS call graph be specified at compile time. We’d add special instructions for sampling procedural textures, and force the compiler to assume that some of its registers are volatile whenever a procedural texture is sampled from. We have our main shader, with its register footprint, the set of TSes, with their register footprints, and an evil sampling instruction which might decide to transfer control into a particular TS and clobber all the registers which that TS uses. Actually, it should probably be two instructions, one to start sending UVs to the texture unit, and another to flush the sampling and jump into the TS as necessary.

This allows the hardware to execute the texel shader by grabbing whichever thread triggered the miss and diverting it into a TS for texel processing. This thread is sitting there doing nothing, so we might as well stick it in our TS and put it to work.

For a given TS call, the compiler knows:

  1. What TS it might end up in
  2. How many registers this TS requires
  3. Which values are live across the TS

The compiler has considerable flexibility in deciding what to do with the live values. It can spill before entering the TS. It can make the TS spill for it. It can recompute clobbered values after exiting the TS, which is probably a great choice for things like interpolants. It can try to keep values in non-volatile registers to protect them the corruption of the TS. This is really not all that different from what CPU compilers have to do to deal with handle procedure calls in C. In fact, its easier than that, because the whole program is available and the ABI and calling conventions are whatever we want them to be.

Programmers can help by attempting to structure their shaders so that the TS invocations happen in areas of low register pressure. They should probably be encouraged to pull their texels out of their TS as soon as possible.

Nested TS invocation is fine with this model. Recursive entry into the same TS is dicey, but doable as long as there’s a way to pass a stack pointer around in between the different instances. There would need to be some stack-like mechanism anyway to hold the return address, so extra information could probably be shoved into that. This approach does introduce some constraints on the API and programming model. You have to specify all your TSes at once, and you can’t mix and match different TSes with the same shader. Given the clear trend towards monolithic pipeline states in the APIs I don’t think its a very serious problem.

This mechanism seems to me like the most robust way to implement the texel shader, so I’m going to assume it for the rest of the post. Instead of Texel Shaders, we should probably call these things something else, since the idea is probably usable for a lot more things besides texels. I propose the name: Asynchronous Shader Subroutines The astute reader might enjoy re-reading this section and replacing ‘TS’ with the corresponding acronym.

Texture Unit Changes

There is one key difference between this model and the current texture pipeline, and that is that there is a loop from the texture unit through the shader pipeline and back again. To support the fully general case (texel shaders sampling other textures) it will be necessary to make significant changes to the texture unit. This is an important detail and it would be a dis-service to the hardware guys to trivialize it. Unfortunately, I’m not much of a hardware person, but based on my limited knowledge I do believe that this is a solvable problem.

Consider today’s texel units. I don’t mean to be such an AMD fanboy but they’re the only ones who put block diagrams like this on the internet:

gcn_tex_unit

Today, this entire process is a simple feed forward pipeline. UV coordinates go in one end and data comes out the other end into shader registers. If there’s a cache miss, then we’re going to be stuck waiting for a while, and hoping that the SIMDs have enough work in flight that we don’t notice. With our new design, we suddenly need to bring our stalled thread back to life and get it to do things for us. There are two problems with this:

1. The texture unit reads UV coordinates from the thread’s registers, and it takes multiple cycles to read a full wave’s worth of UVs. Unless we want to make the compiler’s job extremely difficult, we need a way to allow the TS to re-use those input registers. This means that we need to add buffering to hang on to the values that they contain, in case we end up missing multiple times.

2. If we want the TS to be able to sample other textures, then we need a way for the texture sampling process to be made re-entrant. Our calling thread is already stalled on the texture pipeline, so we will need a second path through the pipeline for our TS to be able to make progress.

Below is a sketch of how I think that these problems could be solved. The idea is to insert a queue in front of the texture unit that buffers up the UVs. If we want to support more than one level of TS recursion, then we insert an additional queue for each level we want. These queues sit in front of a big selector that pops the appropriate queue and passes the data down into the existing pipeline.

reentrant_texture_unit

Samples which hit the texture cache can flow through just like they normally would. In the event of a miss, we make sure that all of the remaining UVs for this fetch have made it into the buffers, then trigger the waiting thread to go get the texels we need. We would also need additional copies of the state of the address and weight calculations, and a mechanism to push and pop this state when the recursion level changes.

The worst-case amount of buffering (a gradient fetch on a 3D texture) is 36 bytes per tap, which works out to a little over 2KB of buffering for 64 threads. For comparison, the GCN VGPRs take up 64KB per SIMD, so while it’s not cheap, it shouldn’t be insanely expensive either.

There’s one additional caveat in that diagram, which is that we assume that the filtered texels can start being written into the destination registers as soon as they’re available. Depending on when exactly we missed the cache, we might wind up doing these writes while the TS is running. This means that the TS would not be allowed to use any register that is used as the destination register for a sampling instruction which invokes it.

We could avoid this, but it would mean inserting extra buffering on the return path, and, more importantly, it would add extra latency to our fetch. Instead of pumping texels across the return line very clock, as is presently done, we would need to buffer up the sampling results until the entire wave has cleared, because we don’t know which of them will have missed and we can’t clobber the register until we know the TS is clear. Buffering everything and then draining the buffer would cut our peak fetch rate in half.

If, instead, we make destination registers volatile in the TS, then this thing turns back into a nice, flowing pipeline. Assuming all that queuing and selection logic doesn’t impose more latency, we should (I think) be able to drive this at the same peak rates that the current pipeline runs at. This design still isn’t ideal, in that it’s limited to one TS wave per texture unit, but its a starting point, and it aught to be adequate for lightweight TSes. To do more than that we’d need a texture unit that knows how to track and reorder requests.

You should take this entire section with a grain of salt, because while I’ve written a lot of code in my life, I’ve never designed a circuit. I’m way out of my league with this stuff, but I hope that these scribblings will at least show that the idea is workable, and convince somebody who knows what they’re doing to give it a try. Again, I would welcome learned feedback telling me why I’m wrong.

API

Now that we’ve figured out what the hardware implementation will be like, we can sketch out the API for these things. I’m not going to try and cover every contingency here, just the basics. It turns out that even though we’ve settled on providing everything in one big fat shader, the API can still admit considerable flexibility.

Object Model

Here is a giant diagram illustrating how I think the API could work.

api_objects

We’ll start by pre-compiling the shader code for each TS. This produces an API object that contains code and resource binding information. We’ll also introduce the concept of a TS instance. A TS instance is a particular texture whose contents are generated by invoking a particular texel shader. The TS instance is created by specifying its texel shader, the dimensions of the virtual texture, and a set of resources (SRVs, samplers, CBVs) that are bound to it. This “bind per instance” idea is a slightly different model than is used by the other shader stages, but it seems logical, since there really isn’t really any single pipeline stage named “TS”.

When a TS instance is created, the driver would allocate address space for its texels according to its dimensions. Since a TS instance is just an unusual kind of shader resource, we could create shader resource views of it, and stick these in descriptor tables just like we do now for ordinary textures. There would need to be some kind of barrier API to invalidate any cached texels whenever one of a TS’s resources gets modified, and it would probably work in a manner similar to the resource barriers that we already have.

Shaders which invoke procedural textures must understand that they’re doing so, but they don’t actually need to know which ones yet. We can compile calling shaders independently by defining a series of opaque “Function Slots”, and requiring the calling shader to specify one of them at sampling time, like so:

TexelShader g_Func0 : register(f0);
TexelShader g_Func1 : register(f1);
ProceduralTexture g_Texture0;
ProceduralTexture g_Texture1;
ProceduralTexture g_Texture2;
...
g_Texture0.Sample( g_Func0, sampler, uv );
g_Texture1.Sample( g_Func0, sampler, uv );
g_Texture2.Sample( g_Func1, sampler, uv );
....
g_Texture1.Sample( g_Func1, sampler, uv ); // illegal!
....

We could also tie the TexelShader to the ProceduralTexture object, as below, but that would make texture indexing a lot more problematic.

TexelShader g_Func0 : register(f1);
ProceduralTexture g_Texture0(g_Func0);

The expectation is that we wouldn’t actually generate final code for the calling shaders yet. Rather, the driver would generate whatever it’s internal IR representation is and store that off until PSO creation time. When we create a PSO, we’ll be required to specify a TS to bind with each of the function slots. Not the individual instances, mind you, just the functions. At PSO time, the driver can take all of the code, link everything together, allocate registers, generate any required spills, and so forth. If we liked, we could also allow the API to bind a NULL TS to a function slot, in order to revert back to ordinary texturing.

TS Programming Model

Now that we have figured out how texel shaders might be implemented, let’s figure out how they might be written. This is the correct order in which to do these things.

What we have here is basically a subroutine that’s being launched to fill cache lines with blocks of pixel data. I think that it would make the most sense to express this more like a compute shader (the “gang-of-threads” model), instead of like a pixel shader (the “I’m pretending its a CPU but it’s really not” model). Enabling multiple threads to coordinate on one block allows block-level computations to be factored out and possibly executed more efficiently. The programmer would express the texel shader in terms of a single thread, which happens to know which NxN block it belongs to. Block-invariant computations could easily be identified by the compiler and optimized accordingly.

The inputs to a texel shader would be:

  • The location of its NxN block in the virtual texture (x,y,z coords and mip level).
  • The indices of the texels from the NxN block, spread across SIMD lanes
  • The dimensions of the virtual texture ( as specified at the API)
  • Resource bindings (CBVs,samplers,textures)

The texel shader could do arbitrary math, do buffer and texture reads, and sample other textures. The output of the texel shader would be an NxN block of texels, which it would write to the cache using whatever mechanism works best for a particular piece of hardware. We’d probably need to heavily restrict the range of block sizes that a shader could declare, in order to ensure that everything was suitably aligned. Limiting to square blocks of reasonable size (1,2,4,8,16) would be prudent.

This would be a kind of “Compute Shader Lite”. We probably can’t use GroupSharedMemory for communication since everybody steals it to stage random graphics data, but we might be able to replace it with simple message passing primitives implemented using cross-lane operations (like GCN’s DS_SWIZZLE instruction). Limiting to one wave per block solves our communication problem, but replaces it with a utilization problem. A 4×4 block is only going to utilize 25% of an AMD wave and 50% of an Nvidia warp, and that’s bad, but there’s no reason we need to assume one block per wave or one pixel per lane. If we express the shader in terms of blocks, and not threads, then multiple independent blocks could be packed into the same wave, or a single wave could iterate over a big block and emit it piecemeal.

It would probably be necessary for the texel shader to declare its output format (RGBA8, RG8, etc) ahead of time. In order to reduce the shader’s output bandwidth, the shader should be required to emit the packed texels directly instead of relying on fixed-function hardware to do format conversion (as is done today for pixel shaders).

It would be useful to support emission of standard compressed blocks (e.g. DXT blocks) directly from the texel shader in packed form. Algorithms capable of providing DXT output could thus avoid the cost of decoding it first, and hardware which presently stores DXT blocks compressed in cache could continue to do so where possible. The motivating examples here aresomething like sparse storage of DXT blocks, or on-the-fly decoding of crunch streams. I include an example of the former below.

It would also be a good idea to add hardware support for decoding DXT blocks in registers using the existing decode circuits, and broadcasting the resulting pixels into the proper SIMD lanes for further processing. This would allow texel shader algorithms to use BC blocks as intermediate storage and efficiently decode and process the texels before writing them out.

Case Studies

Here are some examples I’ve cooked up, written in an imaginary TS language that I made up as I went along. None of this code has been debugged and it is probably mostly crap, it is intended for illustration only.

Here’s the “gradient map” idea from above:

Texture2D<uint8> GreyScale;      // Greyscale image: stored as BC4 blocks
Texture2D<uint32> Gradients;    // Gradient image (RGBA8 colors)
 
[dimension(2d)]
[blocksize(1)]    // run on single texels.  Let driver pack it however it will
[format(R8G8B8A8)]
void main( uint2 texel_indices 	:  SV_TSTexelIndices,
                  uint mip       		:  SV_TSMipLevel,
                  out uint32 TEXEL_OUT 	: SV_TSOutput
                )
{
    uint8 g           = GreyScale.Load( uint3( texel_indices, block_mip ) );	
    TEXEL_OUT = Gradients.Load(grey);
}

Here’s a Quadtree texture that stores BC4 or DXT1 blocks at the leaves. This might make a great representation for mask images, which can be sparse and contain lots of homogeneous areas. For instance, this one, lifted from here:

leaf

If we allow the texel shader to directly emit BC4 blocks, then this entire shader can be implemented on the GCN scalar unit, which would leave the vector pipe available for some other concurrent wave. Paradoxically, in this case, launching one wave per 4×4 block might be the ideal solution.

Buffer<uint32> g_Tree;
 
bool IsLeaf(uint32 node) { return node&0x80000000; }
uint FirstChild( uint32 node ) { return (node&0x7fffffff); }
 
 
[dimension(2d)]
[blocksize(4)]  // Shader is run on virtual 4x4 blocks, but emits one BC block
[format(BC4)]
void main(  uint2 block_coords  	:  SV_TSBlockIndices,
                   uint2 texel_indices               :  SV_TSTexelIndices,
                   uint2 dims                            :  SV_TSDimensions,
	   uint mip_index        	:  SV_TSMipLevel ,
                   out uint64 BLOCK_OUT 	:  SV_TSOutput,
                )
{
      uint level_width = dims.x>>mip_index; // assumed square
      uint x = block_coords.x;
      uint y = block_coords.y;
      uint node = g_Tree[mip_index];  // Root nodes of each mip are all together
 
      // walk down the tree until we find a leaf node
      while( !IsLeaf(node) )
      {
           uint child_idx=0;
 
           if( x >= level_width )  // NOTE: Can and should flatten these, I'm lazy
           {
                 x -= level_width;
                 child_idx++;
           }
           if( y >= level_width )
           {
                 y -= level_width;
                 child_idx+=2;
           }
           level_width >>= 1;
 
           node = g_Tree[child_index + FirstChild(node)];
      } 
 
      // The child location on a leaf node gives the start of its BC4 block
      //     Fetch the block and write it out
      uint block = ChildLocation(node);
      uint64 lo = g_Tree[block];
      uint64 hi = g_Tree[block+1];
      BLOCK_OUT = (hi<<32) | lo;
}

One more: YCC DXT implemented using BC4 luma and BC5 chroma. Sub-sampled chroma are left as an exercise to the reader.

Buffer<uint64> g_BCBlocks;
[dimension(2d)]
[blocksize(4)]
[format(R8G8B8A8)]
void main(  uint2 block_coords  	    :  SV_TSBlockIndices,
                   uint2 texel_indices                   :  SV_TSTexelIndices,
                   uint2 dims                                :  SV_TSDimensions,
	   uint mip_index        	    :  SV_TSMipLevel ,
                   out uint32 BLOCK_OUT[4][4]  : SV_TSOutput  // one 64-byte line
                )
{
      // Figure out where this mip starts
      //     Using Marc Olano's trick:  http://gaim.umbc.edu/2010/05/27/mip-size/
      //     Div by 3 can be skipped since we're packing 3 BC's per logical block
 
      uint2 level_dims = (dims/4) >> mip_index; 
      uint block_id = (dims.x*dims.y - level_dims.x*level_dims.y)*4;
 
      // offset from start of level to our block
      block_id += 3*(block_coords.y*level_dims.x + block_coords.x);
 
      // load three consecutive BC4 blocks.  
      //  We can do that in a texel shader, and it might save us some bandwidth
      uint64 Y_block    = g_BCBlocks[block_id];
      uint64 Cr_block  = g_BCBlocks[block_id+1];
      uint64 Cb_block = g_BCBlocks[block_id+2];
 
      // Everything above this line is block-level computation.  Texel-level computation starts below
 
      // Decode the BC blocks.  I'm assuming here a HW instruction that takes a block, decodes it, and
      //       broadcasts the pixels across the SIMD lanes.   
      uint2 texel_offsets = texel_indices - (block_coords<<4) 
      uint8 Y   =  BC4Extract( Y_block,   texel_offsets );
      uint8 Cr =  BC4Extract( Cr_block, texel_offsets );
      uint8 Cb = BC4Extract( Cb_block, texel_offsets  );
 
      uint32 pix = ; // convert to packed RGB (boring, omitted for brevity)
 
      // write out converted texel
      BLOCK_OUT[texel_offsets.y][texel_offsets.x] = pix;
}

2 Comments

  1. Kyle

    I have been dreaming about this for a while. it would be the only way to do forward parallax mapping. right now we have to guess and check with reverse parallax texel positions. a texel shader could rebuild the texture with proper paralax and normals then just call a simple normal shader for texturing.

    one thing that would be needed is recursive texel block render calls. so you dont have to render all the texels on a virtual texture. but a block could recursively call neighbor blocks to be processed. this would make the shader much more efficient. because it wouldnt have to render parts of the “virtual” texture that is not mapped to visible geometry. (occlusion) but still be able to process parts of the source textures that might get deformed into the visible texture.

    this could be done by defineng two types of texel shaders. a forward texel shader runs for each texel on an existing (source) texture and dynamically generates a new (destination) virtual texture. a reverse texel shader would run for each texel on a virtual shader and dynamically read from existing textures.

    also this would make reprojection trivial.

  2. Simon

    Interesting…

    Don’t forget, that you have to store very little state per shader. If you run out of resources for a TS you can just throw away a PS and restart it later.

Comments are closed.