Beating The GPU At Its Own Game (By Stacking the Deck)

Let’s begin with a little lecture on parallel rendering. Broadly speaking, there are three basic approaches to designing a parallel rasterization architecture. They differ mainly in how they approach the problem of going wide on rendering while preserving strictly ordered fragment processing that APIs require. You, can, and should, read about all of them in the seminal paper here.

Naturally, there are lots of hybrids and variations and “in-betweens”, but these three categories in themselves capture most of the most relevant ideas, and they’ve all been implemented by somebody, somewhere.

Parallel Rendering Taxonomy

Sort-First: The screen is subdivided amonst processors, and primitives are distributed to rasterizers in advance. This is that kind of thing you might expect to see done for multi-GPU setups (e.g. for display walls or split-frame rendering). Normally the work distribution would be implemented at the application level using something coarse-grained like AABB or bounding sphere overlap.

Sort-Middle: Work-distribution happens after geometry processing and before pixel processing. Modern GPUs are basically all sort-middle architectures of one sort or another. Primitives are rasterized in order, either fully serially or by parallel screen-partitioned rasterizers, and pixel work is farmed out to multiple threads, with synchronization mechanisms at the back to ensure that the blending happens in the way it should.

Sort-Last: Primitives are transformed and rasterized in parallel, and the rasterizers transmit fragments to screen-partitioned “compositing processors” whose job it is to Z-test and blend and get the pixels in the right order. In the most extreme variant known as: Sort-last-Image, independent processors each rasterize subsets of the primitives, producing complete color and z buffers, which are then re-combined in image space as a post-process. PixelFlow is perhaps the best known example.

Software Rasterization On GPUs

There has been a bit of research recently into using GPU-compute to do rasterization, the best example of which is probably Laine and Kerras.

Laine and Kerras, and their predecessors, have been trying to replicate the full graphics pipeline in its entirety. I’ve always been kindof skeptical of this type of research, since the results are generally measured in multiples of how much slower it goes. It’s always struck me as an interesting way to explore the boundaries of GPU-compute, but never something that would be directly useful.

But I started thinking about depth-only rendering. This is a much simpler problem than the general case, for a variety of reasons:

  1. It’s order-independent. As long as we can resolve depth comparisons, we can rasterize in whatever order we like
  2. There is only one output image, not two. We don’t need to keep color in sync with depth
  3. Not much per-pixel work. Just depth interpolation. Shader engines are just sitting there waiting to be used

Suppose we implemented this using a sort-last approach, where compute threads rasterize lots of triangles at once and fire atomic operations into the Z buffer? How would this compare to a very fast but serialized hardware rasterizer? We can implement this shader if we do some gymnastics. D3D doesn’t support atomics on float data, but luckily, IEEE floats are such that using integer comparisons works just as well, so for non-negative depths we can re-interpret to uint and use InterlockedMin(), and everything will be fine.

The shader is here. Full test app available here.

My rasterizer has numerous problems:

  1. It doesn’t do any clipping. It will explode if primitives land behind the eye.
  2. It doesn’t do any sub-pixel snapping, and it uses floating-point coverage tests, which means it’ll have pinholes
  3. It has no vertex re-use cache, and no easy way to add one, so it’s probably going to do about 2-3x more transform work than the fixed-function path
  4. There’s lots of low-hanging optimization fruit in the code still: Spurious float->int conversions, a viewport transform that I could fold into the view-projection transform, and so on.
  5. Rasterizing one triangle per thread will lead to utilization hell if there is significant variation in triangle size.

None of that stuff in 4 seems to matter much. The atomic operations seem to be what hurts the most.

Points 1 and 2 are probably ok, since this is only a test. You couldn’t do a production rasterizer this way, of course, but I can always test with friendly data. I haven’t taken a stab at a fixed-point implementation but I suspect it won’t add enough overhead to skew my results too badly.

Lack of clipping is a problem, but Olano and Greer fixed that for us.

Since I’m choosing friendly data, point 5 is also not so bad, for now. This problem might be fixable with some clever programming (e.g. stuff triangles in groupshared memory and bucket by size). I’ll leave that for “future work”.

Results

I ran a bunch of experiments on my laptop which has an i3-4010U (Haswell GT1 GPU). If other people want to post results from other GPUs I’ll be happy to collate them. I rendered Utah teapots in windowed mode at 1366×745 pixels with gradually increasing triangle counts. After producing a depth buffer using either the fixed-function rasterizer or compute shader, I use a quad to copy depth into the backbuffer so we can see it. I do this copy for both the compute and non-compute tests, so it hurts them both equally.

Let’s see how badly this sucks:

graph

Not nearly as badly as I expected. In fact, throw enough triangles in here and I’m…. winning? That’s not supposed to happen.

Rambling Discussion

Ok, so maybe it’s possible to out-perform hardware rasterization, at least on this chip with the right primitive distribution and no color buffer. What to make of this result? Does this mean that HW rasterization is on the way out, and that compute will take its place? I think not, for the following reasons:

  1. Every shader op spent on rasterization is a shader op that can’t be used for, you know, shading.
  2. This is not very power-efficient.
  3. This is easy to do, but a pain in the neck to do well.
  4. The hardware is still winning most of the time
  5. Adding a color buffer and shading would be an enormous can of worms.

I could imagine somebody trying this on consoles to eke out more perf for shadow maps, but most will probably prefer to do something interesting with async compute instead, since this will allow the HW rasterizer and shader pipes to run in tandem.

What I do think this shows is that there might be value in looking at alternative rasterization modes in the hardware for order-independent cases. In my tests, I’m comparing 1120 really slow rasterizers (10EUs x 7 threads x SIMD16) against one really fast one. The slow ones are ugly software-based upstarts. The really fast one has an entire asic and memory sub-system designed around its every need. Given the same degree of hardware support, a few dozen really fast ones might be really compelling for workloads with the right characteristics (low per-pixel load, high primitive counts, unordered). Most of our real-time rendering workloads do not fit this description, but there are some which do.

2 Comments

  1. Nice post, and very interesting results! Here are the results that I gathered from my GTX 970, running at 1920×1080 resolution:

    69696 tris. Rasterizer: 0.209952 ms – Compute: 1.157376 ms – Slowdown: 5.512574
    107584 tris. Rasterizer: 0.258944 ms – Compute: 0.215232 ms – Slowdown: 0.831191
    166464 tris. Rasterizer: 0.436512 ms – Compute: 0.216640 ms – Slowdown: 0.496298
    270400 tris. Rasterizer: 0.472768 ms – Compute: 0.178272 ms – Slowdown: 0.377081
    419904 tris. Rasterizer: 0.369088 ms – Compute: 0.233472 ms – Slowdown: 0.632565
    529984 tris. Rasterizer: 0.572672 ms – Compute: 0.211776 ms – Slowdown: 0.369803
    652864 tris. Rasterizer: 0.685184 ms – Compute: 0.273088 ms – Slowdown: 0.398562
    788544 tris. Rasterizer: 0.618720 ms – Compute: 0.285312 ms – Slowdown: 0.461133
    937024 tris. Rasterizer: 0.780448 ms – Compute: 0.274592 ms – Slowdown: 0.351839
    1098304 tris. Rasterizer: 0.372032 ms – Compute: 0.310464 ms – Slowdown: 0.834509
    1272384 tris. Rasterizer: 1.235520 ms – Compute: 0.316640 ms – Slowdown: 0.256281
    1459264 tris. Rasterizer: 2.139776 ms – Compute: 0.521824 ms – Slowdown: 0.243869

    I actually modified the code to run all of the tests sequentially, and to use GPU timestamp queries to isolate cost of the depth buffer generation. I also tried removing the check in the shader before the InterlockedMin, and got these results:

    18496 tris. Rasterizer: 0.246592 ms – Compute: 0.268224 ms – Slowdown: 1.087724
    69696 tris. Rasterizer: 0.189568 ms – Compute: 0.190016 ms – Slowdown: 1.002363
    107584 tris. Rasterizer: 0.246336 ms – Compute: 0.183968 ms – Slowdown: 0.746817
    166464 tris. Rasterizer: 0.368128 ms – Compute: 0.167168 ms – Slowdown: 0.454103
    270400 tris. Rasterizer: 0.541184 ms – Compute: 0.167424 ms – Slowdown: 0.309366
    419904 tris. Rasterizer: 0.385408 ms – Compute: 0.243168 ms – Slowdown: 0.630937
    529984 tris. Rasterizer: 0.592096 ms – Compute: 0.250112 ms – Slowdown: 0.422418
    652864 tris. Rasterizer: 0.670912 ms – Compute: 0.289024 ms – Slowdown: 0.430793
    788544 tris. Rasterizer: 0.813376 ms – Compute: 0.302112 ms – Slowdown: 0.371430
    937024 tris. Rasterizer: 0.830560 ms – Compute: 0.349408 ms – Slowdown: 0.420690
    1098304 tris. Rasterizer: 0.483872 ms – Compute: 0.381568 ms – Slowdown: 0.788572
    1272384 tris. Rasterizer: 1.777664 ms – Compute: 0.369056 ms – Slowdown: 0.207607
    1459264 tris. Rasterizer: 2.842208 ms – Compute: 0.459744 ms – Slowdown: 0.161756

Comments are closed.