When I was writing my renegade disassembler, I spent a great deal of time taking apart the instruction encodings for GCN, and wrote a lot of code for manipulating and inspecting instructions. It occurred to me as I was doing this that I might be able to repurpose that code for something a little more interesting.
One of the problems with static shader analysis is that it can be difficult to judge exactly what the bottlenecks will be, particularly when control flow is taken into account. One can always test the effects of an optimization by measuring real workloads, but this requires implementing the entire shader in a working application. Even if you already have a working application available, implementing a simple shader change in the context of a full application can sometimes be very time consuming. It involves touching not only the shader, but also the application side object which sets up and feeds that shader. It can even involve modifying asset pipelines and toolchains.
There is value in being able to experiment quickly with different permuations of the same algorithm without having to constantly set up complete working implementations. Suppose, for example we’re contemplating moving some calculation or other into a LUT. If we can demonstrate that, under conservative assumptions, the extra ALU ops we do now are always swamped by the latency of the existing fetches, then we can rule out that optimization without going to all the trouble of implementing and measuring. The ability to “trivial reject” like this can be a significant time saver.
Pyramid now has a “Scrutinizer” feature, which can simulate shader execution on GCN, in order to more easily identify bottlenecks. The remainder of this post will function as a brief tutorial and documentation. The “Scrutinizer” is in a very rough, work-in-progress state, and I welcome feedback.
Of course, no post about Pyramid would be complete without some ranting about how I wish the rest of the IHVs would expose their shader ISAs. Seriously guys, I’ll do this kind of thing for you too if you’ll just make it possible.
Using the Scrutinizer
First, compile an HLSL vertex or pixel shader, select ‘AMDDXX’ from the backends dropdown, then click the ‘Scrutinize’ button. You will be greeted by the window shown below:
In the top-right corner is a representation of the shader’s control flow graph. Scrutinizer will do some control-flow analysis and attempt to identify loops in the shader. You can clock on the various control flow nodes to highlight the corresponding instructions.
There are various pieces of information that you can provide to help guide the simulation:
- You can specify a texel format and, if applicable, a filter type for each buffer and texture instruction. The format and filter type will be used to figure out the minimum latency for the fetch.
- For vertex shaders, you can specify a vertex cache hit rate (verts/tri). For pixel shaders, you can specify an average triangle size in pixels. Scrutinizer will use these to figure out how frequently new wavefronts can be issued.
- You can supply an iteration count for each loop. If there are branches inside of a given loop, you can specify whether or not they are to be taken. More elaborate controls (e.g. take this branch on odd numbered iterations) might be feasible, but I didn’t want to complicate things too much. I’m open to suggestions
Once you’re all done, click the ‘simulate’ button.
Scrutinizer will first traverse the control flow graph to figure out the sequence of instructions that the simulated shader should execute, taking into account the branch/loop information that you provided. Then, it will feed the resulting instruction sequence to a simulator which will track the execution of simulated wavefronts in a simulated GCN CU, and report some statistics. As of this writing, the statistics are as follows:
- Clocks per wavefront
How long each wavefront took on average from start to finish.
Rough estimate of how many verts or pixels you can expect to execute with this shader.
Fraction of time that various instruction types were being executed (VALU, VMEM, and so on). These numbers can give you a sense of where the bottlenecks are in a given shader. If one number is near 100% and the others are not, you know what types of optimizations you might want to try. Note that the simulator does not (cannot) simulate cache misses, so a high ALU utilization can sometimes be misleading. A high VMEM utilization, on the other hand, always indicates a fetch bottleneck.
- Starve rate:
Sometimes the simulated CU will take less time to execute a wavefront than it does to receive a new one. This happens quite often with vertex shaders. The ‘starve rate’ is the fraction of time during which the simulated CU was completely empty. If there is a non-zero starve rate, then its possible that the GPU will be underutilized. This kind of thing happened to the CoD team in their subdiv pipeline. Factoring the wavefront issue rate into the simulation can help identify these sorts of problems.
- Stall rate:
GCN allows memory access instructions to run asynchronously with ALU instructions. Shaders are required to use the S_WAITCNT instruction to synchronize memory access with dependent ALU operations. The “stall rate” is the fraction of simulated clocks during which the CU could not issue any instructions due to S_WAITCNT dependencies. A non-zero stall rate may be a good indicator of an export or fetch bottleneck. You can also see the stall rate for each individual S_WAITCNT instruction. It will be displayed next to the instruction after simulation. Note that because multiple waves are in flight at a time, the sum of the individual S_WAITCNT stall rates may exceed the overall stall rate.
GCN is the best documented GPU architecture around, but it’s still surprisingly difficult to paint a robust picture of what goes on. I had to make assumptions as I went, which I’ll try to list here. RED TEAM: If any of your architects read my blog, I would love to get clarification on any or all of this.
There are a few simplifying assumptions we make right off the bat.
Assumption 1: No Cache Misses
This is likely to generate some eye rolling, but it is the only reasonable assumption that we can make given what we’re trying to do. Trying to simulate a real cache is using anything other than measured data is probably going to give us nonsensical results.
What this means is that any ALU bottlenecks that we identify are suspect. If the ALU/MEM imbalance is severe, then its likely that there is a real ALU bottleneck, but if its marginal, then the odds are good that your shader is not really and truly ALU bound. You need to be aware of the simulator’s limitations when interpretting its results. Eventually, I’d like to see if I can come up with a way to identify the “latency potential” of a given shader (that is, how many clocks of memory latency it can tolerate before it stalls).
Assumption 2: Coherent flow control
We don’t attempt to model divergent branches between threads, primarily to keep things simple. This is not always true of real workloads, but it can give us a starting point. Handling divergence would be a useful extension, but if we went down that route, we’d have to know what happens to masked out waves for VMEM or export instructions. It’s possible that a masked wave might have a lower latency for these operations, but I’ve not found any docs on this.
Assumption 3: No DS instructions
These aren’t implemented yet. When they are, I’ll probably need to make some sort of assumptions about bank conflicts.
Assumption 4: Vertex fetch
If you’ve been playing around with Pyramid, you may have noticed that every AMD vertex shader begins with an S_SWAPPC instruction. This is used to invoke a “fetch shader”, whose job it is to fetch vertex attributes. The actual code for the fetch shader depends on what you specify in your input layout. Since we can’t know exactly what the driver does for this, we just synthesize a fake “fetch shader” based on the number of input vertex elements. This is done by inserting some buffer read instructions at the front of the simulated instruction stream.
Simulating Wave Dispatch
We use simple models to determine how quickly wavefronts can be issued to our simulated CU.
Somebody was kind enough to provide some details from the classified console documentation. So we know that the VGT can queue up 1 new vert per clock, and can test up to 3 indices for vertex re-use.
Assuming a post transform cache hit rate of A verts per tri, we need 64/A triangles to collect a new wave, meaning we can issue a new one every min( 64, 64/A ) clocks. We assume round-robin scheduling amongst N CUs, meaning that our single simulated CU gets a new wave every N*min(64,64/A) clocks.
The vertex cache hit rate depends on the layout of your index buffer, and can range between 0.5 and 3. A value of 1 is fairly typical for optimized meshes.
Similarly, we know from various online sources that the rasterizer stamps out 16 pixels (4 2×2 quads) per clock from one triangle. The dispatch rate for PS then, depends on the average triangle size. Let’s assume that the rasterizer is smart and can pack 2×2 quads across triangles into one wavefront. We’ll launch a new wave as soon as we’ve accumulated 16 quads. Given on average p pixels/tri, this takes: 16/ max( 1, min(4, ceil(p/4))) clocks.
Simulating instruction Issue
The instruction issue for GCN is pretty well documented, so it’s straightforward to simulate the instruction flow. There are a few grey areas which I had to make assumptions about, which I’ll describe below.
There is some confusion on the internet over what the co-issue categories are. Layla says one thing. Steven Hodes says another. I’m assuming that on its turn, each SIMD can issue any or all of the following, on seperate waves:
- 1 scalar op (ALU, mem, or branch)
- 1 VALU
- 1 VMEM
- 1 LDS
- 1 GDS/Export
- Any many “free” scalars (NOP, S_WAITCNT, etc..) as it can find
- Instruction arbitration is always oldest wave first. In other words, if multiple waves are able to issue the same instruction type, the one that started executing first will issue first. This tends to cause a single wave to surge ahead until it blocks on a memory access, which seems like the logical thing to do. This is probably the right arbitration rule since it ensures that whatever wave is scheduled after it is likely to have a long sequence of instructions to issue before it too is blocked .
- ALU instructions take 4 clocks, except for transcendentals and others which take longer. The details are pretty well documented, although it isn’t quite clear whether some of the divide assist ops count as “transcendental” or not.
- I assume VMEM instructions are fully in order, and that up to 600 can be in flight. According to the ISA docs, each wave has a 4 bit counter which tracks the number of in-flight VMEM instructions. I have seen the driver issue more than 16 VMEMs at a time before waiting on them, and it’s not documented when happens in that case. We’ll assume that the hardware throttles them, that is, a wave will not issue a VMEM until it has less than 15 in flight.
- Load instructions run at a rate of 16 DWORDs/clk. Sampling instructions run at up to 4 texels per clock, regardless of channel count. This is based on the fact that there are 4 “texture filter units” per CU, and 16 “load/store units”. Since you only get 16 addresses/clk, there isn’t much benefit to using fewer channels (except for fat formats). Filtered fetches are modified by filter type and texel format, as has been well documented on the internet.
- I’ve ignored any extra cost for 3D textures, arrays, and gradients, because I’ve yet to find public documentation for how much worse it gets.
- Scalar memory instructions return up to 4 DWORDs/clock, and at most one SMEM instruction is retired per clock. Since I’m ignoring cache misses, simulating SMEM instructions in order is probably a good approximation to what actually goes on. The ISA docs state that the scalar memory instructions can return out of order, but beyond that it isn’t clear how they execute. I’m going to assume that the HW is not clever enough to retire SMEMs from multiple waves in a single clock, but I could be wrong.
The only thing I know with any certainty is that the export rate is 16 64-bit pixels/clk, or 8 128-bit pixels/clk as stated by Miss Packman. This means that a full wave exports in 4 or 8 clocks, respectively. I’m assuming that the export rate is the same whether it’s to the parameter cache (VS) or the render backends (PS), though its possible that the two have different bandwidths.
Since I’m only trying to simulate a single CU, I have a problem when it comes to modeling the effects of an export instruction. The targets of an export instruction, the ROPs or the param cache, are shared amongst all the CUs. I have no idea what the arbitration rules are for this, but even if I did, I’d have to simulate a full GPU slice in order to model them accurately.
For the moment, I’ve opted to approximate this by assuming that exports are round-robin amongst the CUs, and also assuming that the export pipeline is being completely swamped by the other CUs, so that our simulated CU always has to wait for the worst-case amount of time. This means that the latency of an export instruction will be 4*N, where N is the number of CUs. This will probably over-estimate the cost of an export, but at least this way if the simulator tells us we’re NOT export limited, we can be pretty confident that its true.
I can do better than that by simulating multiple CUs, but before I do that it’d be nice to know how close this model is to being correct.