If you’re familiar with my blog, you know I’ve lately taken a fancy to doing abominable things to my Intel GPU’s OpenGL driver.
In my last post, I emphasized thread-level GPU programming, and showed that significant gains are possible. In this post, I’m going to look at what happens if I pretend that my Haswell GT2 GPU is really a 20-core CPU with 7-way hyperthreading, a big register file, and a really awesome SIMD instruction set, and then write ray-tracers with it. We will see that the “wrong” way is sometimes better than the obvious way.
Experimental Setup
The benchmark for this post will be constructing a photon map with 10M photons in a variation of the BART Kitchen scene. For simplicity’s sake, I use only diffuse surfaces. This generates a large number of extremely incoherent rays, and is a pretty nice worst-case scenario. The acceleration structure used for all experiments is a simple binary BVH. This is the same experiment I used a while back when did my AVX Optimized CPU tracer, so I’ll borrow that benchmark for the GPU tracer.
The CPU version of this test can be found here.
My test code can be found in my HAXWell github repo.
All graphs show ray rates in KRays/s on an Intel Core i3-4010U, averaged over three runs. I only use one of my 2 cores for these tests but I’ve also included the extrapolated two-core results to give an idea of what the CPUs limits are.
Disclaimer: I Can’t Feed The Beast
I opted to implement this by treating the GPU as a raytracing co-processor. I have a CPU thread which generates a queue of rays and dispatches a compute job for every 4096 rays that are pushed onto the queue. After “priming” the queue with a large number of photon rays, the CPU thread starts reading rays back, and either generating secondary rays or starting new photon paths and pushing the new rays onto the queue. The idea is that we can hopefully pipeline this thing deep enough that the GPU and CPU can run in tandem. GPU does the raytracing, and CPU does shader evaluation and such like. This is an ideal design if we want to design a renderer that can support a plugin architecture, where arbitrary user code can feed rays to the raytracing core.
Unfortunately, try though I might, I wasn’t able to achieve complete GPU utilization with this setup. No matter what method I use for the GPU trace, the GPU is finishing rays faster than I can launch new ones. The best I can do is roughly 60% GPU utilization. A GPUView run makes the situation quite obvious. I have little chunks of GPU work landing in the queue at regular intervals with bubbles in between.
This is largely the result of using OpenGL as my backdoor API. Profiling shows that the bulk of my time is being spent in OpenGL calls. There are ways around this problem, and it probably wouldn’t be an issue if I could do this with OpenCL or Vulkan, but it’s too much hassle for me to figure out how to feed custom ISA to an entirely different API. As much as I’d like to be able to demonstrate a complete, scalable implementation, it’s going to have to wait. It’s a solvable problem but one I can’t solve at the moment.
Instead, I’m simply going to measure the GPU duration of my ray tracing jobs and report which Kernel structure uses the GPU most efficiently, and what the theoretical ray rates are.
Baseline
To start with, we’ll look at doing this in GLSL, the way you might expect to. We’ll take a straightforward stack-based raytracer and shoot one ray per thread. We’ll use a thread-group size of 16 because allegedly what Intel’s ideal SIMD width is. The results are not great. Actually worse than what our CPU could do if I bothered to utilize the other cores. Here it is, compared to two different streaming CPU raytracers. Purple bars show what the blue bars would be if they scaled perfectly to both of my cores.
As a quick test, let’s see what happens when we drop the thread group size in our GLSL shader from 16 down to 8. When we do this, we find that things go a little faster.
According to my instruction issue tests, SIMD8 has roughly 50% less compute throughput than SIMD16, at least for the common dual-issuable floating point instructions. Yet we get a sizable performance gain. Why? There are two main factors. First, less rays per threadgroup means less rays in flight, which means there is more cache available to each ray. Second, SIMD8 instructions have lower latency, which means there is less of a relative performance loss when coherence goes to pot near the bottom of the BVH. In situations where lots of thread masking is expected, wider SIMD is a liability.
Going Beyond SPMD
Now lets look at other ways of doing this.
In CPU raytracing, one of the classic techniques is “packet tracing” whereby a group of rays (4 or 8) is traversed through the scene along the same path. At each node, intersection tests are performed on the entire group simultaneously using vector instructions (SSE or AVX), and if any of the rays in the group wants to proceed further, the entire group is dragged along with it. The goal is to amortize the cost of stack manipulation and node access across multiple rays. The results are generally very good when rays are coherent, and terrible when they are not. So let’s start by implementing a simple 8-wide packet tracer using GEN assembly. It turns out that there are some really interesting tricks we can play here:
The packet tracer differs from the GLSL tracer in a significant way. In the GLSL tracer, the SIMD lanes can take divergent paths through the tree. In the packet tracer, this does not happen. Because all the SIMD lanes in the packet tracer always take the same path, and because Intel gives each thread a full register set, we can afford to store our traversal stack entirely in registers and access it using register indirect addressing. This is considerably cheaper than using memory.
But we can do something even more interesting than that: Normally the traversal stack contains pointers to the nodes that we need to visit. Each traversal step pushes the next two nodes onto a stack and the next loop iteration reads the top one. We have enough register space that instead of storing node addresses, we can just store whole nodes. So, instead of doing our loads at the start of traversal, we can do them at the end, which allows us to hide a bit more latency.
Here is the skeleton of our raytracer, glossing over the details. The important instructions are written out in my HAXWell assembly syntax. All of my hand-written kernels follow this structure, differing only in the way the traversal and intersection tests are written.
// load first node into working reg send DwordLoad8(Nodes), node.u, node_address.u // begin traversal loop traversal: // intersection test against node in working reg // if ray missed, goto pop_stack // if this is a leaf, goto visit_leaf // start reading near node into working register for next traversal step send DwordLoad8(Nodes), node.u, node_address.u // start reading far node into stack head for some future traversal step // and increment stack pointer (stored in address register a0) send DwordLoad8(Nodes), Stack[a0.0].u, node_address.u add(1) a0.us0, a0.us0, 32 jmp traversal visit_leaf: // intersection tests here. pop_stack: // check if we've reached the bottom of the stack cmpeq(1) (f0.0) tmp0.u, a0.us0, 0 // decrement stack ptr before the jump, // if jump is taken we don't care about underflow add(1) a0.us0, a0.us0, -32 jmpif(f0.0) finished // copy next node from stack into working reg for traversal test mov(8) node.u, Stack[a0.0].u jmp traversal finished: // Store hit info end
Each of our traversal steps, if it finds a hit, will read the next two nodes. The next node to be visited is fetched directly into the register that our intersection test reads it from. The other, we fetch into the next register on the traversal stack, using indirect addressing on the send instruction. Since the messages are asynchronous, the issuing of the second fetch is able to hide some of the latency of the first fetch. Because we adopt the standard trick of storing sibling nodes in the same cache line, there is (hopefully) no bandwidth penalty from the second fetch. In fact, it’s likely that we come out ahead in bandwidth since we don’t need to worry about the sibling node being evicted and re-fetched later.
Let’s see how an 8-wide packet tracer does:
Compared to GLSL, the results aren’t stellar. Part of the reason is that I couldn’t figure out a nice, cheap way to do ordered traversal for the packet tracer. Another reason is that, as I’ve said before, my GEN assembly code sucks, because I’m not as good at this as the compiler is. The raytracer is very heavy on flow control, and all of my ‘jmp’ macros expand into adds on the program counter, and this is probably the wrong way to do uniform control flow on GEN. Still, the fact that we’re at parity means that there’s probably more performance to be gained from thread-level programming if we had an actual compiler generating this code for us.
And Now For Something Completely Different
Now lets do something that everyone tells us not to. Instead of trying to use data-parallelism, let’s use instruction-level parallelism. We’ll trace one ray per hardware thread, and vectorize our intersection tests as best we can. We use the same stack and traversal strategy as before, but our ray-box test turns from this:
sub(8) tmp0.f, node.f0<0,1,0>, ray_Ox.f sub(8) tmp1.f, node.f4<0,1,0>, ray_Ox.f sub(8) tmp2.f, node.f1<0,1,0>, ray_Oy.f sub(8) tmp3.f, node.f5<0,1,0>, ray_Oy.f sub(8) tmp4.f, node.f2<0,1,0>, ray_Oz.f sub(8) tmp5.f, node.f6<0,1,0>, ray_Oz.f mul(16) tmp0.f, tmp0.f, ray_inv_Dx.f mul(16) tmp2.f, tmp2.f, ray_inv_Dy.f mul(16) tmp4.f, tmp4.f, ray_inv_Dz.f min(8) tmin_x.f, tmp0.f, tmp1.f max(8) tmax_x.f, tmp0.f, tmp1.f min(8) tmin_y.f, tmp2.f, tmp3.f max(8) tmax_y.f, tmp2.f, tmp3.f min(8) tmin_z.f, tmp4.f, tmp5.f max(8) tmax_z.f, tmp4.f, tmp5.f min(8) tmax.f, tmax_x.f, tmax_y.f max(8) tmin.f, tmin_x.f, tmin_y.f min(8) tmax.f, tmax.f, tmax_z.f max(8) tmin.f, tmin.f, tmin_z.f min(8) tmax.f, tmax.f, ray_tmax.f cmple(8) (f0.0) tmp0.f, tmin.f, tmax.f cmpge(8) (f0.1) tmp1.f, tmax.f, 0.0f and(8) tmp0.u, tmp0.u, tmp1.u cmpgt(8) (f0.0) null.u, tmp0.u, 0 jmpif(!f0.0) pop_stack
into this:
sub(8) tmp0.f, node.f, ray_O.f mul(8) tmp1.f, tmp0.f, ray_invD.f min(1) tmax.f, tmp1.f4, tmp1.f5 max(1) tmin.f, tmp1.f0, tmp1.f1 min(1) tmax.f, tmax.f, tmp1.f6 max(1) tmin.f, tmin.f, tmp1.f2 min(1) tmax.f, tmax.f, ray_data_rcp.f3 cmpgt(1) (f0.0) tmp0.f, tmin.f, tmax.f cmplt(1) (f0.1) tmp1.f, tmax.f, 0 jmpif(f0.0) pop_stack jmpif(f0.1) pop_stack // if ray missed, goto pop_stack
Our triangle intersection test also gets quite interesting, utilizing dot-product instructions and SIMD4x2 instructions with lane swizzling.
On the face of it, this code sucks. Its ALU utilization is abysmal, and the number of instructions per ray is considerably higher. However, in return for poor utilization, we get lower latency, and it’s not going to lose anything near the bottom of the tree when ray divergence kicks in, because this simply doesn’t apply anymore.
Let’s see how it does:
Here is a strange result. Tracing one ray per Warp/Wave/Thread on a GPU is not sucking nearly as much as it’s supposed to. We can really start pulling ahead if we do the same thing, but vectorize 8 ray/triangle intersection tests instead of looping over individual ones. If we also tweak our SAH parameter a bit to bias towards a shallower tree, we end up with a marked improvement. We’re at 25% more perf than the GLSL tracer, and starting to show significant gains over a threaded CPU implementation.
I gave up trying to do a 16x vectorization, because I’ve finally started running out of registers and I don’t feel like hand-allocating them. I also started to work on a qbvh version of this thing, but it’s getting to be a real pain in the neck to write all of this complicated code directly in assembly. Maybe I’ll pick this back up when I’m less burnt out, or maybe we’ll just have to wait until somebody builds me a compiler to see how much better we can get, but we are now ahead of the throughput we could expect from the CPU, and this with terrible hand-written assembly code. Remember, this is running on a low-end, low-voltage laptop chip. Our rays per watt number is probably looking really good.
Closing Thoughts
Warp level programming has the potential to be a useful GPU programming model. It warrants serious consideration. We can potentially do some very powerful things with it but we need better tools before we can make the attempt.
As a thought experiment, consider the same kind of implementation on GCN. A small (~32 entry) traversal stack could fit into SGPRs, and be manipulated using the S_MOVREL instructions. 32 extra sGPRs might hurt occupancy a little bit but should be affordable. Single-ray intersection tests could be implemented on the vector units, albeit very inefficiently, but the utilization matters less here than the latency, and the node access is probably going to swamp everything anyways. Once a leaf is reached, intersection testing could be unrolled 64 times instead of 8 or 16. An unroll factor of 64 means shallower trees, meaning less performance lost on the inefficient traversal code.
The idea is sound, and interesting, in light of the results I’m getting, might be worth trying. You cannot possibly get a GPU language compiler to generate this code for you, but the required hardware is all there. We just need the software to catch up. Maybe this will help.
Another great post on the idiosyncratic GEN architecture!
Warp-centric programming is a great fit for GEN given its 7 isolated threads and register files per EU.
But don’t forget that there are a couple architectural gotchas working against writing purely warp-centric code:
1. A shared memory allocation for a subslice work group must be at least 4KB and is increased in 1KB steps. You can still perform very warp-centric programming but you have to be a little creative in your workgroup sizing if you want to use all 64KB of local/shared memory across all 56 (Gen8+) subslice hardware threads.
2. There are only 16 hardware synchronization barriers per subslice. If your 8-lane warps are truly isolated then this won’t matter but if your SIMD8 warps are coordinating then this is something to be aware of.
Semi-related: note that Intel just released an OpenCL extension that will force a kernel to be compiled to a desired warp “width” (8/16/32). This makes it possible to reason about how a CTA will map onto a subslice. Knowing the kernel’s SIMD with also makes it easier to reason about the shuffle extensions.
I always thought that single-ray-multiple triangles would be the way to go.
Thus 1 ray vs 8 triangles on AVX2.
And if you have a low-poly world, heck, just blast all triangles, so you have no spatial indexing that is a drag to build an update all the time. No problem with diverging paths.
I’ve done some experiments with single-ray-vs-8-voxels too, to photon map a voxel world.
I tried GLSL, OpenCL and AVX2 with this:
http://thelittleengineerthatcould.blogspot.ca/2015/01/rocket-engine-illumination.html
This is really thought-provoking. I love bending hardware in ways its designers never envisioned, so this fascinates me. I am curious how I can apply these methods to AI algorithms, and though I have countless hours of ASM programming experience, a C compiler would be appreciated to help postpone the burnout you mentioned. Thanks for sharing, and now I must check out your other posts as well for more thought-gems like this.