Let’s begin with a little lecture on parallel rendering. Broadly speaking, there are three basic approaches to designing a parallel rasterization architecture. They differ mainly in how they approach the problem of going wide on rendering while preserving strictly ordered fragment processing that APIs require. You, can, and should, read about all of them in the seminal paper here.
Naturally, there are lots of hybrids and variations and “in-betweens”, but these three categories in themselves capture most of the most relevant ideas, and they’ve all been implemented by somebody, somewhere.
Parallel Rendering Taxonomy
Sort-First: The screen is subdivided amonst processors, and primitives are distributed to rasterizers in advance. This is that kind of thing you might expect to see done for multi-GPU setups (e.g. for display walls or split-frame rendering). Normally the work distribution would be implemented at the application level using something coarse-grained like AABB or bounding sphere overlap.
Sort-Middle: Work-distribution happens after geometry processing and before pixel processing. Modern GPUs are basically all sort-middle architectures of one sort or another. Primitives are rasterized in order, either fully serially or by parallel screen-partitioned rasterizers, and pixel work is farmed out to multiple threads, with synchronization mechanisms at the back to ensure that the blending happens in the way it should.
Sort-Last: Primitives are transformed and rasterized in parallel, and the rasterizers transmit fragments to screen-partitioned “compositing processors” whose job it is to Z-test and blend and get the pixels in the right order. In the most extreme variant known as: Sort-last-Image, independent processors each rasterize subsets of the primitives, producing complete color and z buffers, which are then re-combined in image space as a post-process. PixelFlow is perhaps the best known example.
Software Rasterization On GPUs
There has been a bit of research recently into using GPU-compute to do rasterization, the best example of which is probably Laine and Kerras.
Laine and Kerras, and their predecessors, have been trying to replicate the full graphics pipeline in its entirety. I’ve always been kindof skeptical of this type of research, since the results are generally measured in multiples of how much slower it goes. It’s always struck me as an interesting way to explore the boundaries of GPU-compute, but never something that would be directly useful.
But I started thinking about depth-only rendering. This is a much simpler problem than the general case, for a variety of reasons:
- It’s order-independent. As long as we can resolve depth comparisons, we can rasterize in whatever order we like
- There is only one output image, not two. We don’t need to keep color in sync with depth
- Not much per-pixel work. Just depth interpolation. Shader engines are just sitting there waiting to be used
Suppose we implemented this using a sort-last approach, where compute threads rasterize lots of triangles at once and fire atomic operations into the Z buffer? How would this compare to a very fast but serialized hardware rasterizer? We can implement this shader if we do some gymnastics. D3D doesn’t support atomics on float data, but luckily, IEEE floats are such that using integer comparisons works just as well, so for non-negative depths we can re-interpret to uint and use InterlockedMin(), and everything will be fine.
My rasterizer has numerous problems:
- It doesn’t do any clipping. It will explode if primitives land behind the eye.
- It doesn’t do any sub-pixel snapping, and it uses floating-point coverage tests, which means it’ll have pinholes
- It has no vertex re-use cache, and no easy way to add one, so it’s probably going to do about 2-3x more transform work than the fixed-function path
- There’s lots of low-hanging optimization fruit in the code still: Spurious float->int conversions, a viewport transform that I could fold into the view-projection transform, and so on.
- Rasterizing one triangle per thread will lead to utilization hell if there is significant variation in triangle size.
None of that stuff in 4 seems to matter much. The atomic operations seem to be what hurts the most.
Points 1 and 2 are probably ok, since this is only a test. You couldn’t do a production rasterizer this way, of course, but I can always test with friendly data. I haven’t taken a stab at a fixed-point implementation but I suspect it won’t add enough overhead to skew my results too badly.
Lack of clipping is a problem, but Olano and Greer fixed that for us.
Since I’m choosing friendly data, point 5 is also not so bad, for now. This problem might be fixable with some clever programming (e.g. stuff triangles in groupshared memory and bucket by size). I’ll leave that for “future work”.
I ran a bunch of experiments on my laptop which has an i3-4010U (Haswell GT1 GPU). If other people want to post results from other GPUs I’ll be happy to collate them. I rendered Utah teapots in windowed mode at 1366×745 pixels with gradually increasing triangle counts. After producing a depth buffer using either the fixed-function rasterizer or compute shader, I use a quad to copy depth into the backbuffer so we can see it. I do this copy for both the compute and non-compute tests, so it hurts them both equally.
Let’s see how badly this sucks:
Not nearly as badly as I expected. In fact, throw enough triangles in here and I’m…. winning? That’s not supposed to happen.
Ok, so maybe it’s possible to out-perform hardware rasterization, at least on this chip with the right primitive distribution and no color buffer. What to make of this result? Does this mean that HW rasterization is on the way out, and that compute will take its place? I think not, for the following reasons:
- Every shader op spent on rasterization is a shader op that can’t be used for, you know, shading.
- This is not very power-efficient.
- This is easy to do, but a pain in the neck to do well.
- The hardware is still winning most of the time
- Adding a color buffer and shading would be an enormous can of worms.
I could imagine somebody trying this on consoles to eke out more perf for shadow maps, but most will probably prefer to do something interesting with async compute instead, since this will allow the HW rasterizer and shader pipes to run in tandem.
What I do think this shows is that there might be value in looking at alternative rasterization modes in the hardware for order-independent cases. In my tests, I’m comparing 1120 really slow rasterizers (10EUs x 7 threads x SIMD16) against one really fast one. The slow ones are ugly software-based upstarts. The really fast one has an entire asic and memory sub-system designed around its every need. Given the same degree of hardware support, a few dozen really fast ones might be really compelling for workloads with the right characteristics (low per-pixel load, high primitive counts, unordered). Most of our real-time rendering workloads do not fit this description, but there are some which do.