Comments on Future Directions for Compute in Graphics

Andrew Lauritzen’s Open Problems talk, as expected, has shaken me from my blogging slumber. I figured his talk would be inspiring enough to do a post on, and he doesn’t disappoint. I think he’s laid out a pretty compelling picture of where hardware and APIs aught to be headed. This post will be a collection of random ideas that come to mind while pondering the slides.

Visibility Buffers and Shading

The presentation starts with a look at visibility buffers for deferred shading (see here and here). This was brought up as an example of how static resource allocation is limiting, which we’ll get to, but I’m going to go on a bit of a tangent to talk about shading itself.

The idea behind these visibility buffers is to render out some identifying information per pixel (material ID, primitive ID, etc), then sort the pixels by material prior to shading. There seems to be a lot of interest in this approach because it tries to fix some of the hang-ups with deferred rendering (G-buffer size and lack of material diversity), while retaining the increased efficiency for lighting and shading. It comes with some acknowledged drawbacks:

  • Need to cache the animated/transformed verts somehow (or re-compute during material passes).
  • No gradients (gotta do triangle setup manually, per pixel, and all your textures are slower bc they’re gradient fetches)
  • Need a pass to bin pixels before dispatching shaders

If we’re so frustrated by deferred’s limits that we’re seriously talking about sorting all the pixels, then it might be time to back up a bit and look at forward again, because it’s starting to become trendy again, and it does this whole sort by material and then shade thing really well.

Forward does the occupancy binning at drawcall granularity, which has a smaller n than pixel granularity. It lets us avoid saving off VS results by using the existing VS->Setup->PS data path. It gives us access to hardware triangle setup, it gives us msaa if we want, which in turn gives us a of free adaptive super-sampling, and it does it all without big G buffers.

There are some corresponding problems with forward, of course:

  • Quad efficiency: Solvable in hardware with quad free shading, which also requires manual derivatives, but you’ve already offered to do that : )
  • Redundant VS work on back-facing geo: Solvable in hardware with position only shading.
  • Overdraw: Solvable with Z prime, or in hardware through TBDR.
  • Missing depth information for SSR, SSAO: Also solvable with Z prime, or by putting these algorithms in object space where they belong.

Forward may be a losing proposition on current hardware, but if this talk is about things we want to make the hardware guys do for us, then getting them to fix forward rendering is certainly on the table.

Getting back on topic: The problem that visibility buffers encounter is that you need big uber-shaders for all your materials, and uber-shaders have this annoying property of forcing every control flow path to run at worst-case occupancy. Tomas presented a neat way around this, which was to bin the shaders by occupancy. Andrew points out that this is intractable across vendors, but we can easily build a dynamic dispatch API to fix this. Here’s the recipe:

Define Functions as a first class API object. Make an API for obtaining function addresses and passing them to shaders, perhaps using a descriptor-like model. Make an API to query which occupancy bucket a particular caller/callee pair needs to go in, another one to set the occupancy at dispatch time, and a stipulation that you’ll be TDRing if you screw up and allow a shader to execute a path that’s in the wrong occupancy bucket.

I worry that by the time we’re done, this feature is probably going to look more like the ill-fated DX11 class linkage API than C++. We’d need to do some manual setup for each potential caller/callee relationship, and we’d have to book-keep all of this on the app side. I don’t see how it’d be practical to support indirect calls at arbitrary nesting depth, and polymorphic class hierarchies, without making a giant mess of it. There’s a risk that it will turn into yet another expensive, narrowly scoped feature (like the UAV counters).

Improving Occupancy

Moving on, there was some discussion of ways we can remove some of the barriers to good GPU occupancy. One of the possibilities was to use the cache and spill more. As Andrew notes, this probably won’t pan out. The numbers really don’t work. If I’m at 50% occupancy on GCN and I want to get to full occupancy, I need to spill 24 registers (6KB per wave), and if I have 40 waves in a CU doing this, then even with Nvidia’s shiny new 128K L1$, I’m screwed.

Another possibility is allowing hardware to run at higher thread count but with a narrower SIMD width. I’ve seen repeatedly in my bloggings that doing less work per wave can sometimes be more efficient. It makes Intel’s GS implementation very compelling, and I’ve seen it improve shader performance in counter-intuitive ways (see here and here). Andrew asked whether the SIMD width abstraction does more harm than good, and while I more or less agree, I should point out that allowing implementations to vary width is perhaps an under-utilized source of good.

Dynamic Resource Allocation Is Closer Than It Seems

We might be able sneak out some version of dynamic register allocation by looking at variation in resource needs over the lifetime of a shader. The real working set of a shader is not a fixed constant like the occupancy binning suggests it is. For example, a PCF filter might pull in a large number of texels and use a lot of registers at first, but as the filter is applied, data is consumed, and some of the registers might become unused later on. So, you can imagine cutting a shader up into pieces based on occupancy, going wider in some places and narrower in others, and arranging them in a DAG, as shown below (note that those wave counts are complete BS guesswork). I also like to link the GRAMPS paper in this blog whenever possible.

It actually seems feasible to implement this on today’s hardware with a minor change. You launch an N-wave group, where N is the occupancy of the most parallel node. Let’s call these logical waves. You go through the nodes in some fixed order. Upon reaching a node with occupancy M, you barrier the waves, adjust the GPR allocations, and let M physical waves go through. For M=N, physical and logical waves are one to one. For M < N, we have the physical waves iterate over the logical waves, and turn the rest of them off. The only thing that keeps this from being implementable today is the fact that we can't seem to write the VGPR_BASE and VGPR_SIZE registers from inside a wave. Red Team: Get to work guys. We expect a prototype by next Siggraph : )

We also need a place to store data which flows between nodes. LDS might suffice, although it’s a bit too small if we want to try and use all 40 wave slots. We could also imagine setting aside another portion of the register file which is not re-allocated at the barriers. This is probably more invasive, because you now have two different register spaces leaking into the ISA, but there might be a way to make it work.

Whether or not this is useful depends on the “occupancy profile” your average shader. I have no data, but it would be neat to go get it. If most shaders generate lots of live values and holds onto them over their entire lifetime, then this is not going to help anything, but if there are periods where the register footprint is small, then spikes and falls back off, then cutting them up could allow better occupancy during the less constrained periods. We might be able to resurrect some old-school multi-pass partitioning research to automate the shader cutting.

Dynamic Parallelism Seems Further Off

The problem with what I just laid out is that it ties up waves which might be better used executing some low-pressure thing that just happens to fit in the leftover space. What we’re really after is proper nested and dynamic parallelism, the ability to launch and retire waves on the fly, dynamically shuffling threads between waves, recusion, forking, spooning, and all the other fun stuff. We could have the appearance of this today if we want, but if it’s really just syntactic sugar around indirect dispatch, then it doesn’t actually buy us much.

It’s the right long-term goal, for sure but I don’t think we can count on any of these things materializing until we manage to make the hardware more elastic. Phase one is for us to try and convince the hardware guys that they should stop naming hardware blocks after API shader stages, and implement the whole thing on top of a general framework. In the long run, a general purpose machine is better for them than a giant black box where each driver gets its own special set of knobs.

Language Basics

From a programming point of view, it would be fantastic if I could just take my C++ code and execute it directly on the GPU. If I could write all my code the same way using the same constructs. That would be a really elegant, convenient way to program. There was a time not too long ago when we made very significant hardware and software changes in order to provide an elegant, convenient way to program, and the result was the geometry shader : )

I see great potential here for shooting ourselves in the foot. I agree that shader programming is deficient and awkward in many ways compared to C++, but I think that most of the important issues can be fixed in toolchain. I’d like to see changes in how the shader stages are defined, but we don’t need to make any huge shifts in how individual stages are authored.

We definitely need things like a proper type system, bytes, halfs, shorts, pants, and struct layout compatibility between CPU/GPU. I’m on the fence about whether we need real pointers. I don’t mind them, but if the compiler guys have any objections I’ll quickly change my mind. We also need some static polymorphism of some kind (templates or lambdas or something). Separate compilation and linking would be nice to have, but I have to point out that HLSL has had linking for a while now, and it seems it’s not getting used much.

I want to raise a big red flag about physically separate functions, where chunks of code are compiled once and reused at execution time. The potential benefits from real (physical) code re-use are reduced code size, recursion, and the ability to implement C++. Are these benefits worth the effort and risk? Is code size a serious problem for anybody right now? Below is a small example to illustrate how performance can go wrong:

// It's object-oriented.  How modern :)
class TextureLayer : implements ITextureLayer
   Texture2D<float4> tx;
   sampler sampler;
   float2 uv_scale;
   float4 multiplier;
   virtual float4 Sample( float2 uv );
float4 TextureLayer::Sample( float2 uv )
   return layer.tx.Sample( this->sampler, uv*this->uv_scale )*this->multiplier;
   // compiles to: 
   //     descriptor load
   //     muls
   //     WAIT
   //     sample
   //     WAIT
   //     muls
   //     return
// Assume 3 or 4 other TextureLayer subclasses that do polymorphic things 
//    (procedurals, adapters, and stuff).  
//  Note that "virtual" has nothing to do with the problem.
// The same issue occurs with separate compilation and static dispatch
//  I'm just trying to show how a plausible code structure which has this problem.
///*********** Other translation unit *************
#include "TextureLayers.h"
ITextureLayer* l1, l2, l3;
float4 main()
   // Want:  3 fetches, pipelined, with one wait, followed by some mads
   //  Get:  3 fetches, not pipelined, 3 waits, muls and adds as seperate instructions, and also some jumping
   return l1->Sample(uv) + l2->Sample(uv) + l3->Sample(uv);

I see great potential for death by 1024 cuts. Modularity, without link time optimization or jit, is at odds with efficient code. We can mitigate this risk with very disciplined code organization, the same way we do with C++, but it’s easy to get sloppy here, particularly if you want to allow artists and game designers to touch the shaders, or if we’re inclined to OOP all over the CPU code and decide to write the shaders in the same way.

We tolerate the slop in CPU-land because we have out of order cores to help us, and because we have soooo much code up there that we really do need to worry about how big it gets. We also just don’t care sometimes, because a large fraction of the code isn’t that critical, and we’d rather it be modular and flexible because the scales tilt that way. For the parts that are critical, we tend to want to dial back the abstraction and encapsulation.

The situation in GPU-land is different. There is no massive 90% of code whose performance doesn’t matter. We’re trying to do gigaflops worth of work on a 16ms budget, so it all matters, and I think we might be taking for granted just how much the inline everything model has been protecting us from ourselves.

I think the right near-term strategy is for the languages is assume fully inlined kernels and have the languages and APIs embrace it. We can have the appearance of separate compilation at the source level, but keep linking and JIT at the API and hardware. We should also do everything we can to push a more generalized, reconfigurable graphics pipeline, because, while we may not be ready for full cilk, we’re getting increasingly closer.