The following are my own personal thoughts and opinions, and are not shared, endorsed, or sanctioned by anybody in particular (especially my employer).
Here is an example of something that’s becoming increasingly common. Using the tessellator to turn a coarse mesh into a very fine sub-pixel mesh. Hardware tessellation has been a key driver for hardware design lately, and has encouraged IHVs to finally parallelize their triangle rasterization.
I can understand the desire to do this. After all, we want to add lots and lots of detail to our scenes, and a displacement map is a fairly efficient data structure (compared to a gigantic mesh), and its soooo easy to use. But there’s a caveat: These triangles are itty bitty, and pixel shaders on itty bitty triangles are very inefficient. On today’s hardware, there is a price to be paid for using triangles that small.
See here for a great analysis of part of the problem (rasterizer granularity). Another important factor is poor pixel shader utilization. Every triangle which is ever rendered always shades at least four pixels, in a 2×2 quad, so that the hardware can use differencing to compute mip levels. For a great primer on this, see Fabien’s excellent series.
A while ago, Stephen Hill wrote this post, which illustrates the problem quite nicely. As triangles get smaller and smaller, more and more quads are underutilized. Here is the pie-chart image from Stephen’s post:
To illustrate the point, let’s take Stephen’s pie chart and calculate how many pixels per quad we shade, on average. Eyeballing the pie chart, I’m going to estimate that about 40% of quads have 1 active pixel, 40% have 2, 8% have 3, and 12% have 4. Average pixels per quad is: 0.4*1 + 0.4*2 + 0.08*3 + 0.12*4 = 1.92 pixels per quad, meaning that 52% of the pixel shading work is wasted. That’s lots and lots of flops. It is probably not hard to construct cases in which the number of wasted pixels is even higher.
It’s worth nothing, off the bat, that deferred shading does not suffer very much from this problem. However, deferred brings other drawbacks with it. For forward rendering scenarios, if we could somehow spend those cycles on useful work, we can recover quite a bit of performance in cases like this.
If we’re going to tessellate so much that our wireframes look solid, one option is for shader writers to push the pixel shading work into the vertex/domain shader. This is vaguely similar to the way Reyes renderers operate, but not quite. This could be implemented on today’s hardware, provided we can find a way to provide our shader with UV derivatives, but it would result in a ton of wasted work shading lots of back-facing and occluded vertices.
Another possibility is quad fragment merging. QFM is an interesting idea, but it has limitations. Merging arbitrary fragments into a quad is liable to cause artifacts in cases where one part of a mesh occludes another part. QFM essentially works around this problem by limiting merging to triangles which are edge adjacent (among other things).
Getting Rid of Quads
There is another, much simpler solution. We could design the hardware so that pixels are tightly packed into SIMD lanes, allowing multiple pixels from multiple triangles to be packed into a single warp/wavefront.
I’m hardly an expert in this, but it seems to me that this wouldn’t be very hard to build (much simpler than QFM). We’d just need a stage in between rasterizer and PS which receives the quad stream, tosses out the helper pixels, and buffers up the rest until it accumulates enough of them to fill a warp/wave. If you want the old 2×2 quad behavior, just don’t drop the helper pixels. This might add a bit of latency to the pipeline, but with a complex enough shader this latency is hidden.
There is, of course, a drawback. If we got rid of the quads, we would no longer be able to use differencing within a quad to obtain gradients, and we really need gradients.
Do we Need Quads?
I think that we need to be willing to think outside the box here. The main reason we are using quads in the first place is to be able to obtain gradients via differencing so that mip level calculation can be performed. It’s also occasionally useful to hand these derivatives to the shader for it to do interesting things with them, but mip level calculation is still the primary motivation.
There are a few categories of texture fetches that I think cover a large majority of cases:
1. Unmippped fetches: Shadow maps, Scene depth tricks (fog-volumes, soft-particles), Small lookup tables
2. Interpolated UVs: Atlased diffuse/spec/normal textures.
3. Functions of Interpolants: UV tiling/animation, cube mapping, screenspace refraction/distortion tricks
4. Dependent Fetches E.G. Spatially varying UV animation.
For point 1, nothing needs to change. Point 2 can be dealt with by enabling the rasterizer feed the gradients to the shader, which could then pass them to the texture unit by using gradient fetch instructions. The rasterizer is doing interpolation, and thus has everything it needs to figure out the derivatives of the interpolants. The cost of each fetch might increase, since fetches with explicit gradients typically cost more than the implicit variety, but in the tiny triangle case this penalty would be offset by fewer fetch instructions executed.
I haven’t tried this, but it might even be possible to use the existing pull model support to derive gradients which are accurate enough for filtering. This part might not even need a hardware change.
Point 3 is trickier, but it can also be dealt with for simple cases. For example, if we are simply translating the UV coordinate in our shader, the gradient is unchanged. If we are scaling UVs, we can scale gradients. If we are using a reflection vector, we can use ray differentials, which the raytracing folks have been doing for years. For more interesting cases, we might be able to get our tools to help us out. It is a little known fact that the HLSL compiler can do symbolic differentiation (and has been able to do so for quite some time, apparently).
Dependent fetches are probably too hard, and if we really need ddx/ddy, then we’re just plain out of luck. But in this case we can always just fall back to using 2×2 quads. The hardware still needs to be able to do this for compatibility, and so we’d still have access to everything we had before.
This feature would be very easy to expose. There could be a gradient free shading bit in the rasterizer state, which can only be turned on with a gradient free shader. It is a simple matter for the compiler to identify shaders that are gradient free. Developers could turn on this bit whenever they knew they had a gradient free shader and expected to have lots of tiny triangles. Runtime would validate gradient-freeness if it liked. This is a very non-invasive feature (the best kind). So non-invasive that IHVs could selectively flip it on in their drivers for legacy titles, if they were so inclined.