The latest Pyramid now contains an option to compile DX shaders directly against the AMD driver, and comes equipped with its own disassembler. The disassembler is modular enough that it should be able to be ripped out and re-used by anybody who needs one.
I accomplished this by tinkering on my own time and with my own resources. I received no inside help. This is obviously not something that is officially sanctioned, endorsed or supported by AMD. In fact, I suspect that it will irritate a few people, and I apologize if you’re one of those. It is not my intention to do harm, only to better equip myself and my users.
The purpose of this post is to describe exactly how I pulled this off. I have three objectives:
- To entertain the reader with my tale of nerdery.
- To demonstrate that this feat could have been (and was) accomplished using only publicly available tools and information.
- To provide some more incentive for the Blue and Green Teams to follow the Red Team’s example and provide something similar.
Green Team! Blue Team! Seriously guys, I will do the same for you. It is in your best interests for devs to study your hardware as much as they already study GCN. Two functions and a spec are all I need.
Step 0: The Motivation
A while back, I was working on some shaders at work, and was like, “I know, I’ll use Pyramid Shader Analyzer” to see the generated GCN assembly. That way I’ll be able to gauge how the shader runs on AMD cards, thus giving AMD an important tactical advantage over its competition (hint, hint).
So I pulled it down, and was disappointed to learn that it does not want to function on my Nvidia machine. As you may recall, Pyramid is a thin wrapper around AMD CodeXL. CodeXL does not like to run on a non-AMD system, but I thought I could trick it into running anyway by sticking a driver next to it.
As you’ve probably gathered, I love AMD, but IT gave me an Nvidia card, and Dell cases are extremely cramped and difficult to manipulate with my big, fat, programmer-fingers. I really don’t want to have to grab a screwdriver, get down on my knees, open up the case, swap the card, and reboot, just to do static analysis of a shader.
Nor do I want to be bothered to switch machines. I have a perfectly good CPU right here with the code I want to look at already on it. Up until now, CodeXL had behaved itself just fine if I just stuck the AMD driver next to it, but for some reason, the new version no longer plays ball. And even the old version, which runs fine on my Intel laptop, still refuses to run on my work machine.
I could, and did, spend some time trying to appease it, but my efforts were futile, so I gave up and went to work on something else. But when I got home, the whole episode bugged me.
To be clear, I’m not blaming them for this. They designed their software a certain way and I’ve no right to expect it to work otherwise, but I still really want an offline compiler. Rather than trying to coerce CodeXL to do what it’s authors did not intend, I decided to search for another solution to my conundrum. It turns out that AMD’s driver contains the answer.
Dependency walker is a fantastic tool. Open up a DLL, and it will tell you the names of every function exported by that DLL. The AMD DX11 driver, atidxx32.dll, contains two very interesting and useful entrypoints:
Since I was having trouble doing things the easy way, I decided to try the hard way. I suppose I could have just asked somebody how to do this, but I am a geek, and a stubborn male geek at that. I decided instead to try it the REALLY hard way, and I succeeded.
Step 1: The Debugger
The first order of business is to figure out exactly what CodeXL is doing. We know it’s calling into these driver functions at some point, but we don’t know how, and we don’t know what else it might be doing. Are these compile and free functions independent, or is there some setup call I need to make in order to use them? A few hours with WinDBG and I was able to figure out that it really is as simple and elegant as it looks. CodeXL loads the driver, makes one call to ‘Compile’ and another call to ‘Free’, and nothing else.
Step 2: The Disassembler
So, here’s the compile function:
atidxx32!AmdDxGsaCompileShader: 53581b30 55 push ebp 53581b31 8bec mov ebp,esp 53581b33 56 push esi 53581b34 8b7508 mov esi,dword ptr [ebp+8] 53581b37 57 push edi 53581b38 85f6 test esi,esi 53581b3a 746d je atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b3c 837e0800 cmp dword ptr [esi+8],0 53581b40 7467 je atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b42 837e1000 cmp dword ptr [esi+10h],0 53581b46 7506 jne atidxx32!AmdDxGsaCompileShader+0x1e (53581b4e) 53581b48 837e1400 cmp dword ptr [esi+14h],0 53581b4c 755b jne atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b4e 8b7d0c mov edi,dword ptr [ebp+0Ch] 53581b51 85ff test edi,edi 53581b53 7454 je atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b55 833f0c cmp dword ptr [edi],0Ch 53581b58 754f jne atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b5a 8b5604 mov edx,dword ptr [esi+4] 53581b5d 8b0e mov ecx,dword ptr [esi] 53581b5f 53 push ebx 53581b60 e87b000000 call atidxx32!AmdDxGsaFreeCompiledShader+0x20 (53581be0) 53581b65 8bd8 mov ebx,eax 53581b67 85db test ebx,ebx 53581b69 7434 je atidxx32!AmdDxGsaCompileShader+0x6f (53581b9f) 53581b6b ff7614 push dword ptr [esi+14h] 53581b6e 8bcb mov ecx,ebx 53581b70 ff7610 push dword ptr [esi+10h] 53581b73 e8b8000000 call atidxx32!AmdDxGsaFreeCompiledShader+0x70 (53581c30) 53581b78 85c0 test eax,eax 53581b7a 7512 jne atidxx32!AmdDxGsaCompileShader+0x5e (53581b8e) 53581b7c 8d4708 lea eax,[edi+8] 53581b7f 50 push eax 53581b80 8d4704 lea eax,[edi+4] 53581b83 50 push eax 53581b84 ff7608 push dword ptr [esi+8] 53581b87 8bcb mov ecx,ebx 53581b89 e892010000 call atidxx32!AmdDxGsaFreeCompiledShader+0x160 (53581d20) 53581b8e 8bf0 mov esi,eax 53581b90 8b03 mov eax,dword ptr [ebx] 53581b90 8b03 mov eax,dword ptr [ebx] 53581b92 6a01 push 1 53581b94 8bcb mov ecx,ebx 53581b96 ff10 call dword ptr [eax] 53581b98 5b pop ebx 53581b99 5f pop edi 53581b9a 8bc6 mov eax,esi 53581b9c 5e pop esi 53581b9d 5d pop ebp 53581b9e c3 ret 53581b9f 5b pop ebx 53581ba0 5f pop edi 53581ba1 b805400080 mov eax,80004005h 53581ba6 5e pop esi 53581ba7 5d pop ebp 53581ba8 c3 ret 53581ba9 5f pop edi 53581baa b857000780 mov eax,80070057h 53581baf 5e pop esi 53581bb0 5d pop ebp 53581bb1 c3 ret |
Ok, that’s a bit daunting, but it’s pretty short and self-contained, and there is TON of defensive programming going on. Driver people are cautious types. Lucky for me, defensive programming gives me a lot of information. Let’s go through the compile function bit by bit:
53581b30 55 push ebp 53581b31 8bec mov ebp,esp 53581b33 56 push esi 53581b34 8b7508 mov esi,dword ptr [ebp+8] 53581b37 57 push edi 53581b38 85f6 test esi,esi 53581b3a 746d je atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) |
So, first thing we do is load a dword from ebp+8, and test it for zero. If it’s zero, we jump to 53581ba9, which does this:
53581ba9 5f pop edi 53581baa b857000780 mov eax,80070057h 53581baf 5e pop esi 53581bb0 5d pop ebp 53581bb1 c3 ret |
This is returning an error code, and the error code is one of the standard COM HRESULTs: E_INVALIDARG. This tells us that ebp+8, the first argument, is probably a pointer, and that they don’t want it to be null. Let’s move on:
53581b3c 837e0800 cmp dword ptr [esi+8],0 53581b40 7467 je atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b42 837e1000 cmp dword ptr [esi+10h],0 53581b46 7506 jne atidxx32!AmdDxGsaCompileShader+0x1e (53581b4e) 53581b48 837e1400 cmp dword ptr [esi+14h],0 53581b4c 755b jne atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b4e 8b7d0c mov edi,dword ptr [ebp+0Ch] 53581b51 85ff test edi,edi 53581b53 7454 je atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) 53581b55 833f0c cmp dword ptr [edi],0Ch 53581b58 754f jne atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9) |
Lots of information here. There are three more null checks at esi+8, esi+16, and esi+20. That second branch is jumping around the third one, so we’re probably doing short-ciruit evaluation. The other two both send us down the E_INVALIDARG path.
If we survive, we read another pointer from the stack (our second argument) and check that for null. If that passes, we verify that the dword it points to has the value 12, and we bail out if it doesn’t.
Here is the logic we’ve pulled out so far:
if( !esi ) return INVALID_ARG; if( ![esi+8] ) return INVALID_ARG; if( ![esi+16] && [esi+20] ) return INVALID_ARG; if( !edi || *edi != 12 ) return INVALID_ARG; |
That test for 12 is an important clue. It tells us that they’re using ye old ‘first member of the struct is the struct size’ idiom, which Microsoft used to use all over the place in the windows API. We now know that the pointer in edi is pointing to a 12-byte struct. Sweet! Let’s continue:
53581b5a 8b5604 mov edx,dword ptr [esi+4] 53581b5d 8b0e mov ecx,dword ptr [esi] 53581b5f 53 push ebx 53581b60 e87b000000 call atidxx32!AmdDxGsaFreeCompiledShader+0x20 (53581be0) 53581b65 8bd8 mov ebx,eax 53581b67 85db test ebx,ebx 53581b69 7434 je atidxx32!AmdDxGsaCompileShader+0x6f (53581b9f) 53581b6b ff7614 push dword ptr [esi+14h] 53581b6e 8bcb mov ecx,ebx 53581b70 ff7610 push dword ptr [esi+10h] 53581b73 e8b8000000 call atidxx32!AmdDxGsaFreeCompiledShader+0x70 (53581c30) |
Here’s where the interesting stuff begins. We’re reading two dwords from esi and then calling… something.. It’s not actually AmdDxGsaFreeCompiledShader, that’s just the disassembler guessing based on the only symbol information it’s got. Whatever this function is, it returns a value which we then check for null, and on a null, we jump to our other error case, which is E_FAIL:
53581b9f 5b pop ebx 53581ba0 5f pop edi 53581ba1 b805400080 mov eax,80004005h 53581ba6 5e pop esi 53581ba7 5d pop ebp 53581ba8 c3 ret |
If we survive, we then push two more things and do another call. It’s customary for C++ compilers to put ‘this’ in ecx for a method call, so the next few instructions are apparently invoking a method on that thing we got back from the first call, which was probably a factory method of some sort. Moving on:
53581b78 85c0 test eax,eax 53581b7a 7512 jne atidxx32!AmdDxGsaCompileShader+0x5e (53581b8e) 53581b7c 8d4708 lea eax,[edi+8] 53581b7f 50 push eax 53581b80 8d4704 lea eax,[edi+4] 53581b83 50 push eax 53581b84 ff7608 push dword ptr [esi+8] 53581b87 8bcb mov ecx,ebx 53581b89 e892010000 call atidxx32!AmdDxGsaFreeCompiledShader+0x160 (53581d20) |
We’ve got another method call here, which receives two pointers: edi+8 and edi+4. But we skip over it if the first one returned a non-zero value.
53581b8e 8bf0 mov esi,eax 53581b90 8b03 mov eax,dword ptr [ebx] 53581b92 6a01 push 1 53581b94 8bcb mov ecx,ebx 53581b96 ff10 call dword ptr [eax] |
After that we have a virtual call on our ebx object with no apparent return value. Maybe a virtual destructor? But then why would the ‘1’ be there? Meh, it’s not important. Whatever it is, it happens right before we return. The return value is in esi, but it gets moved to eax because that’s where x86 integer return values go.
53581b98 5b pop ebx 53581b99 5f pop edi 53581b9a 8bc6 mov eax,esi 53581b9c 5e pop esi 53581b9d 5d pop ebp 53581b9e c3 ret |
Piecing all of that together, we can say that the code looks something like this:
DWORD AmdDxGSACompileShader( esi, edi ) { if( !esi ) return INVALID_ARG; if( !*(esi+8) ) return INVALID_ARG; if( !*(esi+16) && !*(esi+20) ) return INVALID_ARG; if( !edi || *edi != 12 ) return INVALID_ARG; object* pThing = ConstructThing( *(esi), *(esi+4) ); if( !pThing ) return UNSPECIFIED_ERROR; return_code = pThing->Method( *(esi+16), *(esi+20) ); if( return_code == 0 ) return_code = pThing->OtherMethod( &*(edi+8), &*(edi+4), *(esi+8) ); pThing->VirtualMethod( 1 ); return return_code; } |
We’ve also learned a lot about the contents of esi and edi. Both are pointers to structures. The edi one is 12 bytes long, and contains ’12’ as its first value.
struct edi_struct { DWORD size_of_struct; // 12 DWORD d0; DWORD d1; }; |
The esi one is more complex:
struct esi_struct { DWORD d0; // passed to 'ConstructThing' DWORD d4; // passed to 'ConstructThing' DWORD d8; // a pointer passed to 'OtherMethod' DWORD d12; // not used here... DWORD d16; // passed to 'Method' If this is null, d20 must be zero DWORD d20; // passed to 'Method' }; |
And here is AmdDxGsaFreeCompiledShader:
atidxx32!AmdDxGsaFreeCompiledShader: 53581bc0 55 push ebp 53581bc1 8bec mov ebp,esp 53581bc3 ff7508 push dword ptr [ebp+8] 53581bc6 a19044dd53 mov eax,dword ptr [atidxx32!AmdDxExtCreate+0x3a95e0 (53dd4490)] 53581bcb 6a00 push 0 53581bcd ff7004 push dword ptr [eax+4] 53581bd0 ff156c60a453 call dword ptr [atidxx32!AmdDxExtCreate+0x1b1bc (53a4606c)] 53581bd6 5d pop ebp 53581bd7 c3 ret |
It’s not really necessary to pick this one apart, the important things are that it doesn’t appear to return anything, and only appears to take one argument. It should be easy to figure out what that argument is.
Step 3: A Shim DLL
At this point, we’ve learned about all we can by inspecting the code. We know enough about the function signatures to write a wrapper DLL that will let us see what’s being passed. All we have to do is make a DLL that exports these two functions, name it atidxx32.dll, and put it someplace where CodeXL find it. We then rename the real atidxx32 to something like ‘real_driver.dll’, and have our shim DLL load it and forward the calls. Everything is done with pointers, so as long as we make sure to pass them through correctly, we can do whatever we like in the meantime. This lets us do things like logging and setting breakpoints. For example, we can do this:
typedef DWORD (CALL *COMPILE_SHADER)( esi_struct*, edi_struct* ); typedef void (CALL *FREE_SHADER) (void*); static COMPILE_SHADER g_pCompileShader = 0; // initialized by a DLLMain static FREE_SHADER g_pFreeShader = 0; #define CALL __cdecl __declspec(dllexport) DWORD CALL AmdDxGsaCompileShader( esi_struct* esi, edi_struct* edi ) { HRESULT h; Log("There is a stack variable at: 0x%08x\n",&h); Log("Before:\n"); esi->Print(); edi->Print(); h = g_pCompileShader(esi,edi); Log("After:\n"); esi->Print(); edi->Print(); return h; } |
Then we run CodeXL and have it compile a bunch of different shaders with a bunch of different asics. We get a whole bunch of results like this:
There is a stack variable at: 0x0028dc44 Before: esi_struct at: 0x0028ded8 { 0x0000006e 0x00000029 0x003e7570 0x00000040 0x0028dec0 0x00000000 } edi_struct at (0x0028decc): { 0x0000000c 0x00000000 0x00000000 } After: esi_struct at: 0x0028ded8 { 0x0000006e 0x00000029 0x003e7570 0x00000040 0x0028dec0 0x00000000 } edi_struct at (0x0028decc): { 0x0000000c 0x0092ca70 0x000001b8 } |
Now we look for patterns in the data. First off, CodeXL is always putting its esi/edi structs right next to one another. The edi_struct is obviously being filled with a pointer and size. That’s where our output is going. Furthermore, the pointer in the edi_struct is the one that’s eventually passed to FreeCompiledShader later on, so that part’s taken care of. The first dword appears to be identifying the asic family. It is always 110 for chips that the AMD disassembler thinks of as asic (SI). It is always 120 for asic(CI). By looping over a large space of possible values, I was able to figure out that 125, 130, and 135 are also accepted by the driver, but I’m not exactly sure what these are. One thing is certain, 130 and 135 are not using the same instruction coding as the others.
The second dword is also constant for a given asic, but changes for asics within the same asic family, and unlike the first one, the driver is perfectly happy to accept whatever nonsense I pass in (even zero IIRC). I haven’t gone looking for the meaning of this value yet, I figured its enough just to pass the values I saw.
The third and fourth are clearly a pointer and a size. That’s where our D3D bytecode is. The last two parameters are a pointer to somewhere on the stack, and a zero. That stack pointer only points 12 bytes ahead of the edi_struct. It might point to something interesting, but I’ve found that I can null it out and the driver is perfectly content, as long as that last dword remains zero. My guess is that CodeXL is just leaving it uninitialized (as well they should, since it doesn’t have to be).
The input still takes a little more work. We know where our bytecode pointer is, but it turns out that it doesnt’ point to the same place we got back from the D3D compiler. The D3D bytecode format is documented in: d3d11tokenizedprogramformat.hpp, which is part of the windows driver kit, but what they don’t tell you is where in the blob to look for it. Adding a d3d blob hex dump to pyramid, I was able to figure out that the blob is just a series of sections identified by magic 4CC codes. It turns out that I’m not the only person who has figured this out. See here, for example. The blob pointer that the driver’s getting points into a chunk named ‘SHDR’ or ‘SHEX’, depending on the shader model. The pointer that CodeXL passses to the driver is located at the beginning of the SHDR or SHEX section. It’s very easy to see this if you just look at an ascii version of the blob, and I added a hex view to Pyramid for this purpose. The section headers stand out quite clearly:
Next, let’s try and figure out what the output looks like. This was, surprisingly, the easiest part. Looking at that address in the memory dump, the first thing I see are the letters: ‘E”L”F’. ELF stands for executable and linkable format. It’s a widely used format for executables and for feeding object code to linkers. Now that I know that they’re returning an elf, I just need to grab some code from the internet and start examining the returned elf. Their elfs are well formed, and contain a section called ‘.text’ and ‘.text’ contains the exact same bytes that CodeXL shows me in its disassembler.
Now that we know what the input and outputs look like, there’s one more thing we need to check on. There is still the possibility that the esi_struct is larger than we think it is. In order to make certain that we’ve got the signature correct, let’s surround our structs in a sea of chaos:
static HRESULT Test( esi_struct* esi, edi_struct* edi ) { const size_t PAD_SIZE = 2048; unsigned char* p = (unsigned char*) malloc( PAD_SIZE ); for( size_t i=0; i<PAD_SIZE; i++ ) p[i] = rand(); size_t offs = rand() % (PAD_SIZE-(sizeof(edi_struct)+sizeof(esi_struct))); edi_struct* new_edi = (edi_struct*) (p+offs); esi_struct* new_esi = (esi_struct*) (new_edi+1); *new_edi = *edi; *new_esi = *esi; HRESULT h = g_pCompileShader(new_esi,new_edi); if( !SUCCEEDED(h) ) Log(" FAIL\n"); *edi = *new_edi; free(p); return h; } __declspec(dllexport) DWORD CALL AmdDxGsaCompileShader( esi_struct* esi, edi_struct* edi ) { edi_struct first_one = *edi; g_pCompileShader(esi,&first_one); for( int i=0; i<10000; i++ ) { edi_struct current = *edi; Test(esi,¤t); if( current.nDataSize != first_one.nDataSize || memcmp(current.pHeader,first_one.pHeader,current.nDataSize) != 0 ) Log("Results differ\n"); g_pFreeShader( current.pHeader); } *edi = first_one; return S_OK; } |
That worked, so I’m pretty confident there’s no surprises waiting for us.
Step 4: Profit!
We now have everything we need to get shader code directly out of the driver. We first compile our blob. Then we search it for SHDR or SHEX. Then we read the bytecode size from the DWORD next to that. Then we load the driver DLL, GetProcAddress for AmdDxGsaCompilerShader, fill out our esi and edi structs, and do the call. We get back an elf binary, locate its ‘.text’ section, and there’s our ISA code, all ready to be disassembled.
All that’s left to do now is write a GCN disassembler. AMD has been kind enough to publish extensive documentation about their instruction set, so it’s a simple matter. Well, simple in one sense. It’s a very large instruction set, and there are a few errata in the docs (for example, the VOP3 forms of VOPC opcodes are apparently VOP3A, not VOP3B as the docs claim). Still, considering how big a document this is, it was remarkably straightforward to do what I needed with it. The most difficult part of this process was copy-pasting all of the opcodes from the PDFs into a source file.
Since I don’t ever intend to write a corresponding assembler, I tried to make my disassembly a bit more elaborate, and include visual hints about whats going on. You can grab the latest Pyramid and give it a try.
Nice article!
Iv got question about the slim dll though.
My knowledge about this is limited, but dont you need to export rest of the atidxx32.dll exports aswell? Or are the Compile/Free the only functions, that are being imported to the CodeXL?
It only needs the ones that CodeXL loads. CodeXL is dynamically loading the DLL and calling on these functions, so they’re enough.
Nice article! About the virtual call passing 1, it’s probably a scalar deleting destructor generated by MSVC. It looks something like:
virtual void * A::’scalar deleting destructor'(uint flags)
{
this->~A();
if (flags&1) A::operator delete(this);
};
This is some awesome work .