You Compiled This, Driver. Trust Me….

I was incredibly inspired by Tomasz Stachowiak’s hack of GCN. It made me wish I owned a GCN machine of my own to start trying insane things with. On the other hand, I dooo own an Intel Haswell GPU, and they dooo publish some really dense documentation that’s allegedly enough to write a complete driver with.

And I dooo have experience figuring out how driver people do things

And it is (was) the holiday break and I dooo (diiiiiid) have a bunch of free time……..

So, here’s the goal:

  1. Obtain a compiled blob of a trivial compute shader using glGetProgramBinary
  2. Splice in our own ISA.
  3. Change the thread counts.
  4. Send it back via glProgramBinary
  5. Do fun things

Here’s some code.

I chose to enter through OpenGL, rather than OpenCL, because I know it a little better, and because it should make it easier to use my hacked up shaders in tandem with graphics without having to deal with inter-op.

Since I only want to figure out how to launch compute shaders, I don’t necessarily need to reverse engineer the entire blob structure. While it might be fun to hook into pixel shaders, it would also be considerably more painful. So I decided to be content with running compute jobs on shader storage buffers. Once I figure out how to patch the ISA, all the resource binding stuff will just work. I can just compile a GLSL shader that uses every resource I might want, throw away all of the driver’s ISA, and replace it with my own. The driver will happily bind all of the shader storage buffers or textures or whatever else it thinks it needs, so if I want to use some resources, I can just use OpenGL to allocate and bind the resouces for me.

I might have used DX12, but I haven’t bothered to install windows 10 yet, and dont want to just yet because it means a new driver and possibly new blobs. DX12 is probably going to be harder because I’d have to reverse engineer two different pieces of blob written by two different companies (Microsoft and Intel), but its more likely to be what you’d use if you were crazy enough to do this in production. Rest assured, people WILL do the same things there before long.

I don’t think anybody’s foolhardy enough to ship a product that does this, but it’s a prospect which aught to be giving Microsoft and the IHVs nightmares. So, DON’T DO THIS IN SHIPPING PRODUCTS! SERIOUSLY! DONT! If even one game does this, the IHVs and OSVs will freak out, remove pre-compiled binaries from future drivers, and never let us have them again, and you’ll be the one responsible for ruining all the fun. You don’t think they’ll be able to remove it after you ship it? There are ways…. If they have to, they will hash your ISA code and replace every shader they find with backported HLSL that was written by an army of interns. I would love to see forward-compatible GPU ISAs, but it’s got to be something that IHVs roll out on their own terms.

The Blob

Getting a blob is easy enough, but upon inspection, there’s nothing particularly obvious in the blob format. Instead of using ELF like AMD did, they appear to use some ad-hoc format that’s generated by memcpying a bunch of driver data structures around. Let’s start by seeing if we can find the shader instructions. That way, if nothing else, this whole boondangle will enable me to disassemble GLSL shaders, which is kindof useful.

Fortunately, we know that this blob contains GEN7 instructions, and we know what those look like, because Intel was kind enough to tell us. We also know that all Intel shader threads must end with a SEND instruction which sets a particular “end of thread” bit, because, again, Intel told us.

We can start by searching the blob, byte by byte, for a bit sequence that matches this pattern. Having found the end of the shader, we can then walk backwards until we stop seeing valid instructions. Here’s some pseudo-code:

while( ptr < end && !is_send_eot_instruction(ptr))
      ++ptr;
end_of_isa = ptr+16;
do
    if( is_legal_native_instruction(ptr-4) )
        ptr = ptr-4;
    else if( is_legal_native_instruction(ptr-2))
        ptr = ptr-2;
    else
        break;

So, having written a trivial GLSL shader and run this search on my blob, I find that the ISA starts at offset....1185. Wait, that can't be right.... No, yes, it really is 1185, at least for this particular blob. It turns out it varies, because there's some text embedded in the front part. Ok, so the driver guys didn't bother to dword align their stuff. That's... interesting, but not really a problem.

CURBE

Before we go any further, the reader may require a quick primer on how thread IDs are handled on this architecture. GPGPU in this hardware is an extension of the multi-media pipeline. To launch compute threads, the driver configures the media pipe to launch the right number of thread groups. Each thread only knows its threadgroup number. The local IDs for a particular thread within a group are passed directly using a constant URB entry (CURBE). URB here stands for uniform resource bufferunified return buffer, which, as I understand it, is basically a block of memory that the hardware uses to pass data around between threads in all sorts of different ways. Each hardware thread which is launched as part of a GPGPU command is assigned a fixed number of CURBEs which are pre-loaded into particular registers at launch time.

Now, Intel hardware supports all sorts of different SIMD modes (SIMD8, SIMD16, and so on). These all boil down to different instructions which a thread can execute. Whether a particular compute kernel is SIMD8 or SIMD16 really only matters to the compiler. The compiler chooses which instructions to use, and assigns the appropriate number of CURBEs accordingly. CURBE is allocated in 256-bit chunks, so a SIMD8 thread will have 3 CURBEs (8 DWORDs each for X,Y, and Z), and a SIMD16 thread will have 6. The GL driver uses SIMD16 mode whenever the thread group size is 8 or more. Otherwise it uses SIMD8. As far as I have seen, there is really no difference between these modes apart from the value of the execution mask when the threads are launched.

Peeking at the area of the blob behind the isa, we eventually notice a pattern. The ISA is always zero-padded to 64 bytes. 512 bytes later is the CURBE data, which is in multiples of 24 or 48 bytes depending on the SIMD width, and contains a telltale pattern of sequential integers. The area in between is a bit of a mystery. The docs say that the EU hardware will attempt to prefetch up to 128 bytes following the end of a program, because the instruction issue hasn't figured out that it's time to stop yet. It could be that this area is just the safety net for instruction prefetch, but if so then its 4x as large as it needs to be. It's also possible that the driver is reserving additional memory here that's used for some mysterious purpose.

Pre-Isa Fields

At this point, looking back at the area behind the ISA is starting to tell us things. By varying instruction count, thread group dimensions, and CURBE usage on the same program, and diffing the pre-isa portions of the blob, we can eventually start to see where things are:

At isa-4 we find the length of the zero-padded ISA block. The same number appears again at isa-128
At isa-40 is the byte length of the CURBE data. There is no CURBE unless the shader uses a thread's invocation ID.
At isa-32 is the SIMD mode. This is 0 for SIMD8, 1 for SIMD16, which matches the values used in the various dispatch commands that the driver constructs.
At isa-24 is the number of threads in the GL workgroup, the product of the x,y, and z dimensions
At isa-100 and isa-700, we have the number of hardware threads to be launched per threadgroup, based on the SIMD mode
At isa-104 and isa-702, we have the number of 256-bit CURBE entries per hardware thread.

I suspect that the area around isa-700 contains the media interface descriptor, because it has the right values in the right relative positions.

There is enough information here to take an existing blob, splice in new ISA and CURBE data, and launch thread groups in whatever configuration we like. We can change the number of HW threads to launch per thread group, we can override the CURBE data, we can change the amount of CURBE data, and we can change the SIMD mode if we need to. I haven't figured out barriers and shared local memory yet, but I don't expect they'll be that complicated.

Sneaking Past The Guards

The next step is to make sure that we're able to modify blobs and trick the driver into accepting them. We start with an existing blob, change an add instruction to something else, send that back, and....yep... failure. The driver is doing a CRC check of some sort. To get around it, we can take two shaders that differ by a constant value (e.g. i+2 vs i+1) and diff the resulting blobs to see what's changed. Eventually we notice that the last 8 bytes of the blob seem to vary.

Now that we have a pretty good idea where the hash is stored, we need to figure out what the hash function is. Let's try the trick we learned from Tomasz: Pass an enormous blob length, crash the hash check, then peek at the disassembly around the crash site. That worked.

Here is the area around the call into the hash function:

....
0F79B543  mov         byte ptr [ebp-4],1  
0F79B547  cmp         ebx,0Fh  
0F79B54A  jb          0F79B629  
0F79B550  lea         edx,[ebx-8]  
0F79B553  shr         edx,2  
0F79B556  mov         ecx,esi  
0F79B558  call        0F6785B0  
0F79B55D  cmp         eax,dword ptr [esi+ebx-8]  
0F79B561  jne         0F79B629  
0F79B567  cmp         edx,dword ptr [esi+ebx-4]  
0F79B56B  jne         0F79B629  
0F79B571  push        7  
.....

This makes it pretty clear. The hash function is called, and the result is compared with the last two dwords in the blob. I tried the "google the magic numbers" trick and all I could find was that the magic numbers in the hash function appear in implementations of SHA-256. Maybe somebody took a SHA-256 hash and ripped off the rest of the bits to make blob loading go faster. Since I couldn't find the exact code for the hash function, I just took the disassembly and back-ported it to C:

void DriverHashFunction( 
    DWORD* pCRC, 
    const DWORD* pData, 
    DWORD nDwords )
{
    DWORD eax;
    DWORD esi = 0x428A2F98;
    DWORD edx = 0x71374491;
    DWORD edi = 0x0B5C0FBCF;
    DWORD ebx = nDwords;
    const DWORD* ecx = pData;
    while( ebx )
    {
        esi ^= *ecx;       //xor         esi,dword ptr [ecx]  
        eax  = edi;        //mov         eax,edi  
        esi -= edi;        //sub         esi,edi  
        eax >>= 0xD;       //shr         eax,0Dh  
        esi -= edx;        //sub         esi,edx  
        esi ^= eax;        //xor         esi,eax  
        edx -= edi;        //sub         edx,edi  
        edx -= esi;        //sub         edx,esi  
        eax = esi;         //mov         eax,esi  
        eax <<= 8;         //shl         eax,8  
        edx ^= eax;        //xor         edx,eax  
        edi -= edx;        //sub         edi,edx  
        edi -= esi;        //sub         edi,esi  
        eax = edx;         //mov         eax,edx  
        eax >>= 0xD;       //shr         eax,0Dh  
        edi ^= eax;        //xor         edi,eax  
        esi -= edi;        //sub         esi,edi  
        esi -= edx;        //sub         esi,edx  
        eax = edi;         //mov         eax,edi  
        eax >>= 0x0C;      //shr         eax,0Ch  
        esi ^= eax;        //xor         esi,eax  
        edx -= edi;        //sub         edx,edi  
        edx -= esi;        //sub         edx,esi  
        eax = esi;         //mov         eax,esi  
        eax <<= 0x10;      //shl         eax,10h  
        edx ^= eax;        //xor         edx,eax  
        edi -= edx;        //sub         edi,edx  
        edi -= esi;        //sub         edi,esi  
        eax = edx;         //mov         eax,edx  
        eax >>= 5;         //shr         eax,5  
        edi ^= eax;        //xor         edi,eax  
        esi -= edi;        //sub         esi,edi  
        eax = edi;         //mov         eax,edi  
        eax >>= 3;         //shr         eax,3  
        esi -= edx;        //sub         esi,edx  
        esi ^= eax;        //xor         esi,eax  
        edx -= edi;        //sub         edx,edi  
        eax = esi;         //mov         eax,esi  
        eax <<= 0x0A;      //shl         eax,0Ah  
        edx -= esi;        //sub         edx,esi  
        edx ^= eax;        //xor         edx,eax  
        edi -= edx;        //sub         edi,edx  
        eax = edx;         //mov         eax,edx  
        edi -= esi;        //sub         edi,esi  
        eax >>= 0x0F;      //shr         eax,0Fh  
        edi ^= eax;        //xor         edi,eax  
        ecx++;             //lea         ecx,[ecx+4]  
        ebx--;             //dec         ebx  
    }

    eax = edi;
    pCRC[0] = eax;
    pCRC[1] = edx;
}

Re-hashing our blobs with this code makes the driver accept them... almost. It turns out there's one other safety check. There's an embedded length field in the very top of the blob whose value changes with the blob size. Probably a "how many bytes come after this next part". This field is located at offset 13, which is... interesting, but no longer surprising. The driver seems to be verifying that this length field agrees with the length passed to glProgramBinary. We can deal with this during our blob-surgery by just looking at the difference between this field and the known length of our "template" blob, and adjusting the value in the new blob accordingly.

Thats it, we now have all we need to splice our own code into a program blob. Getting code is a whole other kettle of fish.

Building Our Own Shaders

The hardest part of this whole exercise turned out to be getting a handle on the instruction formats. I decided to sink the time in to build some C++ wrappers around the GEN instruction set. There's an 'encoder', which takes these little C++ things and builds GEN instructions, and a 'decoder' which does the reverse. There's also a partial disassembler.

This way, I can construct shader instructions using C++ objects, like so:

ops.push_back( 
     GEN::UnaryInstruction( 2, GEN::OP_MOV,
            GEN::DestOperand( GEN::DT_U32, 
                 GEN::RegisterRegion( 
                      GEN::DirectRegReference(GEN::REG_GPR,6,0),
                       2,2,1 ) ),
            GEN::SourceOperand( GEN::DT_U32, 
                 GEN::RegisterRegion(
                     GEN::DirectRegReference(GEN::REG_TIMESTAMP,0),
                       2,2,1))
            )
 );

That's... aweful to work with, but good enough for now. Over time, if I dont get bored, I may try and write some more helpful tooling based on this.

But first, let's analyze the GPU.

Exploring My Machine

All of my experiments were done on an Intel Core i3-4010U, (IntelĀ® HD Graphics 4400), running windows 8.1. My OpenGL strings are:

GL_RENDERER: Intel(R) HD Graphics 4400
GL_VERSION: 4.3.0 - Build 10.18.14.4080

Code is here

Timing EU Threads

Let's start by exploring how threads are spawned and retired. The EU execution environment tells each thread the ID of the EU that it runs on, as well as the index of its thread slot within that EU. There is also a timestamp register which can be used for high precision performance measurements. The docs state that the timestamp is "sourced from Cr clock", but I don't know if that means it counts clocks, or some multiple of clocks, but it's close enough.

By using a shader like the following, we can get a picture of when our threads are starting and stopping:

mov(2)     r6.u<2,2,1>,   tm00.u<2,2,1>		# read thread start time
mov(8)     r5.u<8,8,1>,   sr00.u<8,8,1>		# read state reg for EUID
xor(8)     r4.u<8,8,1>,     r4.u<8,8,1>,     r4.u<8,8,1>	# clear write address reg
mov(2)    r5.u1<2,2,1>,     r6.u<2,2,1>		# copy start time into output 
mul(1)    r4.u2<1,1,1>,    r0.u1<1,1,1>,             120	# compute output address
mov(8)   r127.u<8,8,1>,     r0.u<8,8,1>		# setup payload for EOT
add(1)    r4.u2<1,1,1>,    r4.u2<1,1,1>,     r1.u<1,1,1>	# finish computing write addr
# 
# Code to be timed goes here
#
mov(2)    r5.u3<2,2,1>,   tm00.u<2,2,1>		# read finish time
send         null0.u,             r4.u			# write EUID and times
     desc=0x040a0238 dest=DP_DC0
     len=2  response=0
     OWordBlockWrite ctl=0x2 bind=0x38
send         null0.u,           r127.u			# send EOT message
          desc=0x02000010 dest=SPAWNER
          len=1  response=0
EOT
# Hey, Blue team, how do you like my Disassembly Syntax?

This isn't perfect. Some of those extra instructions could be squeezed out, and of course its not possible to time anything up to and following the final write-out. The measured cost of this shader is 30 of whatever those things are in the timestamp register, but its actually higher than that because it's impossible to time the last two messages.

One of the first things I noted was that the EU numbering is not sequential. The EU numbers in a sub-slice are:

0 8
1 9
2 10
3 11
4 12

This is the order in which threads in a group are sent to the EUs. Normally, Intel's diagrams always show their EUs arranged in pairs, and this is reflected in the numbering system. Bit 3 probably has something to do with the physical location of the EU in the slice. It looks like they intend to scale up to 16 in a slice but haven't made it there yet. I've found it helpful to generate a 'thread slot' which is a fractional number of the form: SubSlice*10 + EUID + (thread+0.5)/7. This lets us generate nice scatter plots where one unit is one EU, and none of the dots land on a gridline.

Also noteworthy is that the timestamps are not consistent between EUs, as illustrated in the graphs below:

eu_absolute_times

On some runs, I get timestamps that all start at zero and count up, which makes sense if the driver somehow resets the TS registers prior to dispatching compute. At other times, I get timestamps that are skewed by EU number. I have no idea what the explanation is for this. Whatever the cause, the effect is that the timestamp register can only be used for relative times local to a particular EU.

I tried using the message gateway timestamp, but it didn't seem to work (I only ever received zeros). That leaves me with two different ways to measure time: GL_TIME_ELAPSED, and EU-relative timestamp register readings. I'll use these two methods in my testing.

Testing Thread Dispatch Methods

Let's look at a few different ways of dispatching threads and see what we can see:

Our first strategy is to launch one HW thread per group, and N thread groups. In this case, we observe that the hardware alternates the threadgroup dispatch between successive slices. First one slice, then the other. Within each slice, the lowest available thread slot on the lightest loaded EU is selected (at least, that's what I'd assume, it could just be simple round-robin).

Our next strategy is to try and create very large thread groups. In this case, we see the opposite pattern. Each threadgroup is fully dispatched to one particular sub-slice. Thread dispatch alternates between the EUs in that sub-slice. Here we encounter a snag. There seems to be a hard limit of 64 hardware threads per threadgroup. When I try to use higher numbers it wraps around to lower ones. This means that it's not possible to launch a single threadgroup that can occupy all 70 thread slots in a given sub-slice.

Our last strategy is to dispatch 7 threads per group, in an attempt to encourage the hardware to fill all of the thread slots with multiple thread groups.

Short Threads

If we dispatch the timing shader I showed above, with no extra ops, we get the results below. Note that our shader is so short that the thread dispatch is not able to issue threads quickly enough to keep the machine full. It appears that dispatching large thread groups is the optimal choice when kernel latency is small. OpenGL claims that the durations are about even, but looking at the plots, I suspect that it lies. I don't have an explanation for that 60x70 case. I've run it numerous times, and the data show a clear load imbalance between the EUs. Perhaps it tries to schedule based on load and then breaks ties in some unbalanced fashion. Small kernels occur in a lot of places in graphics, so there might be a real performance issue in this.

short_threads_fat_groups

short_threads_single_groups

short_threads_7_groups

Long Threads

Now let's try inserting 1024 redundant mov instructions so that the machine can get filled up. Here we see that none of these strategies is able to saturate all of the thread slots. There will always be an idle one. Even when we dispatch single groups, the hardware is always capping the number of in-flight threads at 64. Single-groups does appear to have a slight edge in performance, presumably because the thread start/stop times are less aligned with one another and there are fewer bubbles. Note that I'm not doing anything in particular to cause this 64-thread limit. The reference shader into which I spliced my ISA did not use barriers or shared memory, so either the hardware cannot utilize all available thread slots by design, or else the driver is forgetting to set some bit that it aught to be.

long_threads_single_groups

long_threads_7_groups

long_threads_fat_groups

Finding the Instruction Cache Cliff

Next, let's look at how big our kernels can get before we start running into instruction cache overhead. It's important to identify this for the kinds of throughput tests we're going to try next. Graph is below. Inflection points occur at around 3000 instructions (48KB L1) and 8000 instructions (128KB L2). Note that this is best-case based on test kernels stuffed with ALU. I dont know whether either of the cache levels is shared with data. Note also that even though GEN supports a complicated compressed instruction format, I don't yet support it in my encoder. Using this will probably move the cliffs to roughly 6000 and 8000, but it will depend on the instruction mix.

icache_cliff

Instruction Issue

Now let's look at instruction issue rates next. There are a few pertinent facts scattered about various pieces of Intel documentation. One is that the execution units are 4-wide and that SIMD8 execution is over 2 clocks. Another is that SIMD4 execution has the same latency as SIMD8, for some reason. Docs also state that only a few floating point instructions can be dual-issued (two threads can issue simultaneously), but integer and other ops cannot.

There are a few more things I'd like to know about: First off, I'd like to know what the deal is with destination dependencies. Intel has fields in their instruction format devoted to overriding the register scoreboard for destinations in order to allow instructions to issue faster. Presumably, this means that back to back writes to the same register are slow. Let's test this. We'll run a series of instructions of various types, and pipeline through varying numbers of registers. One reg means every instruction writes the same register, and N-reg means we alternate between N different ones. These tests are always adding the same register to itself and writing the result back. In the case of SIMD16, N indicates the number of register pairs. Results are for 2048 instructions executed in 7680 threads, averaged over 3 runs.

issue_rate_adds

With dual-issue, there's a definite hit when multiple instructions hit the same register back to back. For SIMD4/8, The magnitude of the hit is about 2x. This shows that register write-back and clearing of the scoreboard takes at least as long as execution does. For single-issue, we don't get a penalty, presumably because the hardware is able to cover the latency by switching threads. SIMD16 is about half the issue rate of SIMD8, but of course each issued instruction does more work.

Curiously, there is not as much benefit as one might expect from dual-issue, only 50% instead of the almost 2x we aught to be able to achieve. Why? I suspect that the answer lies in the register file read rate. According to the architecture whitepaper, the EU is only able to read 96Bytes/clock from GPRs. This means that it takes 1.5 cycles to read enough data to dual-issue a pair of instructions. On odd numbered cycles, the second instruction hasn't finished reading its operands, so only the first one goes off. Then, the next clock, we read 96 more bytes and can launch two instructions. That's..... interesting, if true. I could have it wrong.

What happens if we use MOV instructions, which only have one source?

issue_rate_movs

That's.... interesting. Not only do MOVs not issue as fast as they could, the issue at different rates depending on the data type, even if there's no conversion. There's a lesson here. If you're going to do a reg/reg mov on this architecture, make sure it's float typed.

What about that bit that disables the dependency check? The docs seem to think that you should do this often to avoid false dependencies.

issue_rate_noddchk

That's.... interesting. It has the side effect of forcing single-issue. I guess software instruction scheduling is better (where possible).

Extended math instructions contain no real surprises. sqrt, exp, log, rsqrt are all full rate. pow, trig, and fdiv are half rate. Integer divide is 1/16 rate. SIMD16 does not work for idiv, and it's also the first time I've seen SIMD4 behave differently than SIMD8.

extended_math

Block Reads

For our next test, let's look at how much time it takes to send a block read message to the data cache and receive the results. We'll setup a program which does a read, then does a series of N compute instructions before using the result of the read. We repeat that 32 times and measure the EU timestamp delta for that section of code.

By varying N, we can figure out about how much of an ALU/Fetch ratio we want to aim for. I tried the test three different ways: Reading the same address in every thread, reading different addresses in every thread, and reading different addresses with the "invalidate after read" bit set. Invalidate after read is supposed to evict cachelines from the "graphics L3" after they are read. It's intended for spill code, so that any registers that get spilled to scratch memory aren't written back to memory when they don't need to be.

I'm hoping that the read invalidate test will allow me to measure the cost of a Graphics L3 miss (or some approximation thereto). These misses are still going to hit the system L3. Ideally I'd also look at the cost of a miss all the way to DRAM, but I'm not sure how I could set that up. I guess I could loop over a very large dataset, but my encoder doesn't know how to handle loops yet.

block_loads

At full occupancy, it takes about 6 instructions to hide a cache hit, and about 11 or so to hide a "Graphics L3" miss. If the instructions are dual-issuable, you'll of course need more of them.

I'm puzzled by the slowdown that I seem to get when all the loads come from the same address. If you set the read-invalidate bit here, the times skyrocket. This result is a bit unsettling, because it means that a bunch of threads all reading the same address is a pathological slow case (instead of being fast like we'd expect). The same thing happens for scattered reads.

Scattered Reads

For our last experiment lets have a look at scattered reads. These are the types of reads that the compiler will use when you do a buffer load from within a compute kernel. Every SIMD lane reads from a different address. Let's look at when happens when your thread addresses diverge. We'll do a series of scattered reads and vary the number of cachelines that each read operation touches. No real surprises here. The more divergent the read address are, the slower things go.

scattered_reads

Why Did I Go To All This Trouble?

There are a couple of reasons why this is more than just a pointless exercise in reverse engineering. This hardware contains a lot of goodies that the graphics APIs simply do not expose, and I'm wondering if I can exploit any of them to demonstrate cool stuff. I obviously can't do anything in a shipping product, but perhaps we'll find ways of using the hardware that aren't anticipated by current APIs and programming models.

There is a lot of unexposed functionality in this architecture. You can send messages back and forth between threads. Threads can spawn child threads and read/write their register files using message passing. I dont know whether all the right bits are set by the GL driver to enable this, but if it works, it might be fun to experiment with.

You can mix SIMD modes. Use SIMD4 instructions in one cycle and SIMD8/SIMD16 in another. You can use swizzles or register movs to do all sorts of cross-lane communication at very little cost. You can do warp-level programming, where each thread is 1 16-wide instruction stream instead of 16 1-wide streams. You can switch modes multiple times within the same thread if you like.

As some of my Intel friends like to point out, on this hardware, every thread has 4KB of register space. The register file in total is roughly the size of the cache. There's no "use fewer registers to boost occupancy", the occupancy is just there. There is also GPR indexing, and unlike on other architectures it is actually potentially useful. Hardware threads can implement small fixed-size data structures entirely in their register files without it totally sucking.

Small threadgroups with a low local memory footprint could be implemented using one hardware thread and a block of register space, where the one thread executes all N instances of a given group back to back. This can allow more groups to fit in the machine at once, at the cost of higher execution latency per group. Overall throughput might improve, depending on the kernel.

Forget SPMD programming for a second. What happens if we treat this thing like a 20-core in-order CPU with 7-way hyperthreading, limited dual-issue, and a 4-wide SIMD instruction set that's got arbitrary swizzles, write masks, efficient scatter/gather and other goodies that SSE and AVX don't? Maybe I need to go write another raytracer....