Texel Shader Discussion – The Burning Basis Vector

I said I wanted to spark discussion, and I succeeded 🙂 Timothy Lottes blogged back at me again. He writes:

This made up example is just to point out that adding a TS stage serves as an amplifier on the amount of latency a texture miss can take to service. Instead of pulling in memory, the shader waiting on the miss now waits on another program instead. Assuming TS dumps results to L2 (which auto-backs to memory),

Dump out arguments for TS shading request
Schedule new TS job
Wait until machine has resources available to run scheduled work (free wave)
Wait for shader execution to finish filling in the missing cache lines
Send back an ack that the TS job is finished
etc

I’m suggesting that the calling wave be re-used to service the TS misses (if any), so instead of waiting for scheduling and execution, it can jump into a TS and do the execution itself. You actually can’t schedule a new wave and wait unless you’ve explicitly held back some registers for the TS, because if caller is allowed to max out the machine it’s deadlock. Borrowing the calling wave to service the miss will remove most of the launch/signaling latency from the equation and also prevent that problem.

Any attempt to save out wave state and later restore (for wave which needs to sleep for many 1000’s or maybe many 10000’s of cycles for a TS shader to service a miss), is likely to use more bandwidth to save/restore than is used to fetch textures in the shader itself running without a TS stage. Ultimately suggesting it would be better to service what would be expected TS misses long before a shader runs, instead of preemptively attempting to service while a shader is running…

Not if we’re re-directing calling wave. In that case regs stay where they are, and you only need to spill values live across TS, and then only if they’re in volatile regs and not easily re-computable. Compiler has a great deal of wiggle-room here, but this does depend on TS complexity. Complicated TS with huge reg footprint called in the middle of a shader will still be painful.

4 clocks per op * 256 operations * 5 waves interleaved + 4 batches of fetch * 384 cycles of latency
= 6.6 thousand cycles of run-time

That seems like a heavyweight example. Let’s work through the gradient mapping one. It’s an image load, an extremely well cached buffer load, and then a buffer store back to L1. The buffer load and store add 4 cycles each to whatever the latency is on the image load. Might also have some extra addressing math and such in there, but we’re talking percentages, not orders of magnitude. There is a hit relative to an ordinary texture, and I never said there wouldn’t be (at least, not on purpose). What I’m saying is that this hit can be made small enough that it’s worth paying for in order to gain something else.

Unique virtual shading cache shaded in the same 8×8 texel tiles one might imagine for TS shaders, but in this case async shaded in CS instead. With background mechanism which is actively pruning and expanding the tree structure of the cache based on the needs of view visibility. Each 8×8 tile with a high precision {scale, translation, quaternion}, paired with a compressed texture feeding a 3D object space displacement, providing texel world space position for rigid bodies or pre-fabs

I think we have two distinct classes of use case for this thing, and they might not warrant exactly the same solution. One is decoupled shading, where we wanna run our shader and then filter the result. The other is being able to invent smaller texture encodings and filter these directly as part of a shading pipeline (forward, decoupled, or whatever). What I’ve laid out does not work well for the first case, unless you’re filtering something expensive and low-frequency that you can magnify and re-use across pixels.

As expected, @AndrewLauritzen took me to task on twitter. 🙂 I’ll try not to quote him too far out of context.

To play ball though, another problem not mentioned in the blog is that samplers are often pure queues
i.e. even if you have a thread/registers for the TS by some mechanism, it can’t use the sampler easily
.. and if you take a while to run a texel shader (w/o sampling), you’ll completely trash your perf.

The pure queue is mentioned. I try to address the re-entrant sampler there as well. This may not fit Gen as well as it does GCN, since you have more EUs/sampler. Even if you need to disallow sampling in TS, loads-only still leaves a lot of useful options open.

@JoshuaBarczak @JJcoolkl @SebAaltonen I think you’re magic bad assumptions about both the cost of mem access and how long things live in L1$

Data lives in L1 for about as long as it takes the sampler to filter an enter thread’s worth of texels, and not much longer. What I’m suggesting is a memory->compute->sampler stream, via cache, instead of a memory->sampler stream for the “light stuff”.

For the heavyweight “decoupled computation” apps, L1 is probably the wrong place, because you don’t want every single ordinary fetch to kick the results out. Higher cache levels for that stuff would make more sense, but the basic idea of “compute what you need”, is still worthwhile, especially in complex scenes where “figure out what you need” is not cheap.

Right but “light stuff” pretty much means “no sampling” which is inexpensive already.

“Light stuff” means “read some data from memory and do pre-filtering transformations”. Of course this will take longer than the memory latency, but the point is how much longer? Is it so much longer that existing compute can’t cover it? If not, what else do we gain? If the alternatives are decode first (mem footprint, dev complexity, popping/instability), or filter manually (slow). Slightly worse perf and lower footprint can be a worthwhile trade off, especially when residency management is on the App’s plate.

For scheduling, using calling thread is most workable but doesn’t handle de/amplification nicely.

Does it need to? You don’t want to amplify too much or you start shading stuff you don’t use. One thread/cacheline seems workable even for those of you with tiny SIMDs. Y’all can do a 4×4 RGBA8 block in SIMD16 mode. I’ll throw you a bone and disallow output blocks larger than that 🙂 Probably makes sense to disallow TS output blocks larger than one cacheline regardless. Poor GCN is stuck speculating to fill waves but there are probably heuristics to get decent results (e.g. pull in 3 blocks on closest corner).

Basically the hardware proposal is hugely complex to get decent efficiency and it’s not even clear…
… that you’d end up faster than doing it in user land/coarse grained. I suspect you wouldn’t.

For the “heavyweight” cases it’s not entirely clear, but the idea would seem to warrant further study.

For “lightweight”, it might not end up faster, but it’ll end up simpler, smaller, and/or better looking, and I’d be willing to spend a portion of my “faster” budget for that. Simpler is good for me and my dev team, and the other stuff is good for my artists and for my end-users.

UPDATE: Tim’s second reply here

Basically the PS stage shader gets recompiled to include conditional TS execution. This would roughly look like:

(1.) Do some special IMAGE_LOADS which set bit on miss in a wave bitmask stored in a pair of SGPRs.
(2.) Do standard independent ALU work to help hide latency.
(3.) Do S_WAITCNT to wait for IMAGE_LOADS to return.
(4.) Check if bitmask in SGPR is non-zero, if so do wave coherent branch to TS execution (this needs to activate inactive lanes).

Continuing with TS execution,

(5.) Loop while bitmask is non-zero.
(6.) Find first one bit.
(7.) Start TS shader wave-wide corresponding to the lane with the one bit.
(8.) Use the TEX return {x,y} to get an 8×8 tile coordinate to re-generate and {z} for mip level.
(9.) Do TS work and write results directly back into L1.
(10.) When (5.) ends, re-issue IMAGE_LOADS.
(11.) Do S_WAITCNT to wait for loads to return.
(12.) For any new invocations which didn’t pass before, save off successful results to other registers.
(13.) Check again if bitmask in SGPR is non-zero, if so go back to (5.).
(14.) Branch back to PS execution (which needs to disable inactive lanes).

What you describe suggests that the texture unit is much smarter than I expected it to be. VMEM ops are ordered, so my mental model of the texture unit is a serial pipeline processing UV quads. For each quad it calcs some addresses, fetches some lines from L1 (one at a time) and filters. Expensive filtering repeats this from 1-N times. It’s possible that my mental model is wrong in which case a lot of what I wrote is bogus, but if Tex unit is already smarter then the smarts can be repurposed for better TS scheduling (e.g. packing missed lines into waves).

I had it more like this:

In the caller:

   1.  Do magic IMAGE_LOADs.  Start sending UVs
   2.  ALU or whatever.
   3.  s_waitcnt
   5...n-1   one or more jumps into TS to produce lines
   N.  filtered results are now in dest reg. resume

Meanwhile, texture unit is doing this:

   for each UV quad from caller wave
        for each required line
             if(missed && is_ts_fetch)
                 verify calling wave has reached s_waitcnt 
                 send tile location to calling wave
                 signal wave to jump into TS
             block until not missed anymore
             filter texels and send data to VGPRs

This addresses a lot of your points:

(A) Step (10.) has no post load ALU before S_WAITCNT, so it hides less of it’s own latency (even though it will hit in the cache).

The “filter texels” part can happen in parallel with a subsequent TS miss. No need for load on other end if texture unit knows how to push results. See original post for thinking on VMEM ops from TS.

(C) Need to hardware change to ensure TS results are in pinned cache lines until after first access finishing serving a given invocation. This way the IMAGE_LOAD in (10.) is ensured a hit to have some guaranteed forward progress. There is a real problem that 8×8 tiles generated early in the (5.) loop might normally be evicted by the time all the data was generated.

Not so bad if everything is ordered, and if filter can read lines as soon as they arrive. Can evict after that.

(E.) The TS embedded in PS option would lead to some radically extreme cases like multiple waves missing on the same 8×8 tiles, and possibly attempting to regenerate same tiles in parallel.

Shouldn’t happen with strict ordering. What might happen is different CUs concurrently shading the same lines, unless writes also go in L2. L2 means less re-shade, but higher TS launch delay.

(F.) The TS embedded in PS option would result in extreme variation in PS execution time. Causing a requirement for more buffering in the in-order ROP fixed function pipeline.

That depends on how variable the miss rate and TS execution time are. That one is a problem I hadn’t considered, but variability in PS is hardly a new problem? All you need for that is flow control.

(D.) Attempt to do random access at 64-bit/texel (aka fetch from 64 8×8 tiles) which all miss. That’s 64*8*8*8 bytes (32KB) or double the size of the L1 cache.

With serialized quads limit becomes 4 at a time, so at least it all fits. Bad locality and big blocks is to TS as heavy amplification is to GS. 8x8x64bit is 8 lines of 8 texels each. Getting the feeling that “shade fixed size blocks” should be replaced with “shade a line” or possibly “shade N lines”. Up to programmer/compiler to map lines to Wavefront lanes. Harder to use, but better.

@AndrewLauritzen Tweeted a similar question:

But how do you handle the divergence? You want to invoke TS in cache-line blocks. But calling thread may hit 1->N cache lines and require that many invocations.

Without launching more threads, the only option is to shade lines serially. GCN can (maybe) pack them into one wave, but Gen is stuck looping. After shading a line you’d have to rely on spatial locality to avoid that cost for the other taps in a quad. Same strategy as with memory reads (as I understand it), but latency is longer.

Tim also says:

(B.) Need to assume texture can miss at the point where the wave has already peak register usage in the shader, this implies a shader needs that plus the TS needs in terms of total VGPR usage. Given the frequency of PS work which is VGPR pressure limited without any TS pass compiled in, this is quite scary. Cannot afford to save out the PS registers. Also cannot afford to make hardware to dynamically allocate registers at run-time just for TS (deadlock issues, worst case problem of too many waves attempting to run TS code paths at same time, etc). So VGPR usage would be a real problem with TS embedded in PS.

@rygorous says the same:

Wait, how are you gonna reuse the calling wave for the TS? The TS might need more registers than your wave does!

A TS/PS pair only needs more registers if a TS call happens at “peak GPR”. I don’t have a clear sense of how frequently that happens, or whether it can be fixed by tweaking compiler’s scheduling heuristics when TS calls are used. This definitely favors small short TS over big complicated TS. Complicated TS + Complicated PS + Thrashing = Perfect Storm

Tim closes with:

…. Any time anyone builds a GPU based API which has a “return”, or a “join”, this is an immediate red flag for me. GPUs instead require “fire and forget” solutions, things that are “stackless” and things which look more like message passing where a job never waits for the return. The message needs to include something which triggers the thing which ultimately consumes the data, or the consumer is pre-scheduled to run and waits on some kind of signal which blocks launch.

I hear that, but for this exercise we’ve got compute on both ends of the pipe with hardware in the middle.

An alternative might be to add ways of making shader-based filtering easier/cheaper. For example, given a UV and descriptors, calculate the texel locations and weights for bilinear/trilinear etc. This adds register pressure, and I’m not sure how aniso would work without some kind of implicit looping. It’s also not very work-efficient because you’re decoding every texel all the time.

The only other option is to split the kernel. Multi-passing is one form of that, but it’s got its limits. A gramps-like pipeline, where shader is cut up into tiny micro-kernels, is another form, and TS could be made to fit that nicely, but that seems like it comes with even more HW changes.

2 Comments

woods

October 30, 2015 at 7:05 am

Thanks for the GRAMPS-link. That would be quite ideal compared to present API mess (now if they would only resuscitate Larrabee…)
SomeGuy

October 29, 2015 at 10:04 am

Hey. What about situation when for example whole full screen pass is accessing the same texel? Should TS be called for each thread individually or just for the first one and other wavefronts will be stalled? Also I assume each thread in that scenario would write to the same location in L1 so wouldn’t that cause race condition for thousands of threads?

Comments are closed.