Fine! I’ll Go Build My Own Disassembler! With BlackJack, and…

The latest Pyramid now contains an option to compile DX shaders directly against the AMD driver, and comes equipped with its own disassembler. The disassembler is modular enough that it should be able to be ripped out and re-used by anybody who needs one.

I accomplished this by tinkering on my own time and with my own resources. I received no inside help. This is obviously not something that is officially sanctioned, endorsed or supported by AMD. In fact, I suspect that it will irritate a few people, and I apologize if you’re one of those. It is not my intention to do harm, only to better equip myself and my users.

The purpose of this post is to describe exactly how I pulled this off. I have three objectives:

  1. To entertain the reader with my tale of nerdery.
  2. To demonstrate that this feat could have been (and was) accomplished using only publicly available tools and information.
  3. To provide some more incentive for the Blue and Green Teams to follow the Red Team’s example and provide something similar.

Green Team! Blue Team! Seriously guys, I will do the same for you. It is in your best interests for devs to study your hardware as much as they already study GCN.  Two functions and a spec are all I need.

Step 0: The Motivation

A while back, I was working on some shaders at work, and was like, “I know, I’ll use Pyramid Shader Analyzer” to see the generated GCN assembly. That way I’ll be able to gauge how the shader runs on AMD cards, thus giving AMD an important tactical advantage over its competition (hint, hint).

So I pulled it down, and was disappointed to learn that it does not want to function on my Nvidia machine. As you may recall, Pyramid is a thin wrapper around AMD CodeXL. CodeXL does not like to run on a non-AMD system, but I thought I could trick it into running anyway by sticking a driver next to it.

As you’ve probably gathered, I love AMD, but IT gave me an Nvidia card, and Dell cases are extremely cramped and difficult to manipulate with my big, fat, programmer-fingers. I really don’t want to have to grab a screwdriver, get down on my knees, open up the case, swap the card, and reboot, just to do static analysis of a shader.

Nor do I want to be bothered to switch machines. I have a perfectly good CPU right here with the code I want to look at already on it. Up until now, CodeXL had behaved itself just fine if I just stuck the AMD driver next to it, but for some reason, the new version no longer plays ball. And even the old version, which runs fine on my Intel laptop, still refuses to run on my work machine.

I could, and did, spend some time trying to appease it, but my efforts were futile, so I gave up and went to work on something else.  But when I got home, the whole episode bugged me.

To be clear, I’m not blaming them for this. They designed their software a certain way and I’ve no right to expect it to work otherwise, but I still really want an offline compiler.  Rather than trying to coerce CodeXL to do what it’s authors did not intend, I decided to search for another solution to my conundrum.  It turns out that AMD’s driver contains the answer.

Dependency walker is a fantastic tool. Open up a DLL, and it will tell you the names of every function exported by that DLL. The AMD DX11 driver, atidxx32.dll, contains two very interesting and useful entrypoints:

depends

Since I was having trouble doing things the easy way, I decided to try the hard way. I suppose I could have just asked somebody how to do this, but I am a geek, and a stubborn male geek at that. I decided instead to try it the REALLY hard way, and I succeeded.

Step 1:  The Debugger

The first order of business is to figure out exactly what CodeXL is doing. We know it’s calling into these driver functions at some point, but we don’t know how, and we don’t know what else it might be doing. Are these compile and free functions independent, or is there some setup call I need to make in order to use them? A few hours with WinDBG and I was able to figure out that it really is as simple and elegant as it looks. CodeXL loads the driver, makes one call to ‘Compile’ and another call to ‘Free’, and nothing else.

Step 2:  The Disassembler

So, here’s the compile function:

atidxx32!AmdDxGsaCompileShader:
53581b30 55              push    ebp
53581b31 8bec            mov     ebp,esp
53581b33 56              push    esi
53581b34 8b7508          mov     esi,dword ptr [ebp+8]
53581b37 57              push    edi
53581b38 85f6            test    esi,esi              
53581b3a 746d            je      atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b3c 837e0800        cmp     dword ptr [esi+8],0  
53581b40 7467            je      atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b42 837e1000        cmp     dword ptr [esi+10h],0  
53581b46 7506            jne     atidxx32!AmdDxGsaCompileShader+0x1e (53581b4e)
53581b48 837e1400        cmp     dword ptr [esi+14h],0
53581b4c 755b            jne     atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b4e 8b7d0c          mov     edi,dword ptr [ebp+0Ch]
53581b51 85ff            test    edi,edi                 
53581b53 7454            je      atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b55 833f0c          cmp     dword ptr [edi],0Ch
53581b58 754f            jne     atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b5a 8b5604          mov     edx,dword ptr [esi+4]
53581b5d 8b0e            mov     ecx,dword ptr [esi]
53581b5f 53              push    ebx
53581b60 e87b000000      call    atidxx32!AmdDxGsaFreeCompiledShader+0x20 (53581be0)
53581b65 8bd8            mov     ebx,eax  
53581b67 85db            test    ebx,ebx  
53581b69 7434            je      atidxx32!AmdDxGsaCompileShader+0x6f (53581b9f)
53581b6b ff7614          push    dword ptr [esi+14h]
53581b6e 8bcb            mov     ecx,ebx   
53581b70 ff7610          push    dword ptr [esi+10h]
53581b73 e8b8000000      call    atidxx32!AmdDxGsaFreeCompiledShader+0x70 (53581c30)
53581b78 85c0            test    eax,eax  
53581b7a 7512            jne     atidxx32!AmdDxGsaCompileShader+0x5e (53581b8e)
53581b7c 8d4708          lea     eax,[edi+8]
53581b7f 50              push    eax
53581b80 8d4704          lea     eax,[edi+4]
53581b83 50              push    eax
53581b84 ff7608          push    dword ptr [esi+8]
53581b87 8bcb            mov     ecx,ebx
53581b89 e892010000      call    atidxx32!AmdDxGsaFreeCompiledShader+0x160 (53581d20)  
53581b8e 8bf0            mov     esi,eax
53581b90 8b03            mov     eax,dword ptr [ebx] 
53581b90 8b03            mov     eax,dword ptr [ebx] 
53581b92 6a01            push    1
53581b94 8bcb            mov     ecx,ebx
53581b96 ff10            call    dword ptr [eax]   
53581b98 5b              pop     ebx
53581b99 5f              pop     edi
53581b9a 8bc6            mov     eax,esi
53581b9c 5e              pop     esi
53581b9d 5d              pop     ebp
53581b9e c3              ret
53581b9f 5b              pop     ebx
53581ba0 5f              pop     edi
53581ba1 b805400080      mov     eax,80004005h
53581ba6 5e              pop     esi
53581ba7 5d              pop     ebp
53581ba8 c3              ret
53581ba9 5f              pop     edi
53581baa b857000780      mov     eax,80070057h
53581baf 5e              pop     esi
53581bb0 5d              pop     ebp
53581bb1 c3              ret

Ok, that’s a bit daunting, but it’s pretty short and self-contained, and there is TON of defensive programming going on. Driver people are cautious types. Lucky for me, defensive programming gives me a lot of information. Let’s go through the compile function bit by bit:

53581b30 55              push    ebp
53581b31 8bec            mov     ebp,esp
53581b33 56              push    esi
53581b34 8b7508          mov     esi,dword ptr [ebp+8]
53581b37 57              push    edi
53581b38 85f6            test    esi,esi      
53581b3a 746d            je      atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)

So, first thing we do is load a dword from ebp+8, and test it for zero. If it’s zero, we jump to 53581ba9, which does this:

53581ba9 5f              pop     edi
53581baa b857000780      mov     eax,80070057h
53581baf 5e              pop     esi
53581bb0 5d              pop     ebp
53581bb1 c3              ret

This is returning an error code, and the error code is one of the standard COM HRESULTs: E_INVALIDARG. This tells us that ebp+8, the first argument, is probably a pointer, and that they don’t want it to be null. Let’s move on:

53581b3c 837e0800        cmp     dword ptr [esi+8],0  
53581b40 7467            je      atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b42 837e1000        cmp     dword ptr [esi+10h],0  
53581b46 7506            jne     atidxx32!AmdDxGsaCompileShader+0x1e (53581b4e)
53581b48 837e1400        cmp     dword ptr [esi+14h],0
53581b4c 755b            jne     atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b4e 8b7d0c          mov     edi,dword ptr [ebp+0Ch]
53581b51 85ff            test    edi,edi                 
53581b53 7454            je      atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)
53581b55 833f0c          cmp     dword ptr [edi],0Ch
53581b58 754f            jne     atidxx32!AmdDxGsaCompileShader+0x79 (53581ba9)

Lots of information here. There are three more null checks at esi+8, esi+16, and esi+20. That second branch is jumping around the third one, so we’re probably doing short-ciruit evaluation. The other two both send us down the E_INVALIDARG path.

If we survive, we read another pointer from the stack (our second argument) and check that for null. If that passes, we verify that the dword it points to has the value 12, and we bail out if it doesn’t.

Here is the logic we’ve pulled out so far:

   if( !esi )     return INVALID_ARG;
   if( ![esi+8] ) return INVALID_ARG;
   if( ![esi+16] && [esi+20] ) return INVALID_ARG;  
   if( !edi || *edi != 12 ) return INVALID_ARG;

That test for 12 is an important clue. It tells us that they’re using ye old ‘first member of the struct is the struct size’ idiom, which Microsoft used to use all over the place in the windows API. We now know that the pointer in edi is pointing to a 12-byte struct. Sweet! Let’s continue:

53581b5a 8b5604          mov     edx,dword ptr [esi+4]
53581b5d 8b0e            mov     ecx,dword ptr [esi]
53581b5f 53              push    ebx
53581b60 e87b000000      call    atidxx32!AmdDxGsaFreeCompiledShader+0x20 (53581be0)
53581b65 8bd8            mov     ebx,eax 
53581b67 85db            test    ebx,ebx  
53581b69 7434            je      atidxx32!AmdDxGsaCompileShader+0x6f (53581b9f)
53581b6b ff7614          push    dword ptr [esi+14h]
53581b6e 8bcb            mov     ecx,ebx   
53581b70 ff7610          push    dword ptr [esi+10h]
53581b73 e8b8000000      call    atidxx32!AmdDxGsaFreeCompiledShader+0x70 (53581c30)

Here’s where the interesting stuff begins. We’re reading two dwords from esi and then calling… something.. It’s not actually AmdDxGsaFreeCompiledShader, that’s just the disassembler guessing based on the only symbol information it’s got. Whatever this function is, it returns a value which we then check for null, and on a null, we jump to our other error case, which is E_FAIL:

53581b9f 5b              pop     ebx
53581ba0 5f              pop     edi
53581ba1 b805400080      mov     eax,80004005h
53581ba6 5e              pop     esi
53581ba7 5d              pop     ebp
53581ba8 c3              ret

If we survive, we then push two more things and do another call. It’s customary for C++ compilers to put ‘this’ in ecx for a method call, so the next few instructions are apparently invoking a method on that thing we got back from the first call, which was probably a factory method of some sort. Moving on:

53581b78 85c0            test    eax,eax  
53581b7a 7512            jne     atidxx32!AmdDxGsaCompileShader+0x5e (53581b8e)
53581b7c 8d4708          lea     eax,[edi+8]
53581b7f 50              push    eax
53581b80 8d4704          lea     eax,[edi+4]
53581b83 50              push    eax
53581b84 ff7608          push    dword ptr [esi+8]
53581b87 8bcb            mov     ecx,ebx
53581b89 e892010000      call    atidxx32!AmdDxGsaFreeCompiledShader+0x160 (53581d20)

We’ve got another method call here, which receives two pointers: edi+8 and edi+4. But we skip over it if the first one returned a non-zero value.

53581b8e 8bf0            mov     esi,eax
53581b90 8b03            mov     eax,dword ptr [ebx] 
53581b92 6a01            push    1
53581b94 8bcb            mov     ecx,ebx
53581b96 ff10            call    dword ptr [eax]

After that we have a virtual call on our ebx object with no apparent return value. Maybe a virtual destructor? But then why would the ’1′ be there? Meh, it’s not important. Whatever it is, it happens right before we return. The return value is in esi, but it gets moved to eax because that’s where x86 integer return values go.

53581b98 5b              pop     ebx
53581b99 5f              pop     edi
53581b9a 8bc6            mov     eax,esi
53581b9c 5e              pop     esi
53581b9d 5d              pop     ebp
53581b9e c3              ret

Piecing all of that together, we can say that the code looks something like this:

DWORD AmdDxGSACompileShader( esi, edi )
{
   if( !esi )     return INVALID_ARG;
   if( !*(esi+8) ) return INVALID_ARG;
   if( !*(esi+16) && !*(esi+20) ) return INVALID_ARG;    
   if( !edi || *edi != 12 ) return INVALID_ARG;
 
   object* pThing = ConstructThing( *(esi), *(esi+4) ); 
   if( !pThing )  
        return UNSPECIFIED_ERROR;
 
   return_code = pThing->Method( *(esi+16), *(esi+20) );
   if( return_code == 0  ) 
       return_code = pThing->OtherMethod( &*(edi+8), &*(edi+4), *(esi+8) );
 
   pThing->VirtualMethod( 1 ); 
 
   return return_code;
}

We’ve also learned a lot about the contents of esi and edi. Both are pointers to structures. The edi one is 12 bytes long, and contains ’12′ as its first value.

struct edi_struct
{
   DWORD size_of_struct; // 12
   DWORD d0;
   DWORD d1;
};

The esi one is more complex:

struct esi_struct
{
  DWORD d0;   // passed to 'ConstructThing'
  DWORD d4;   // passed to 'ConstructThing'
  DWORD d8;   // a pointer passed to 'OtherMethod'
  DWORD d12;  // not used here...
  DWORD d16;  // passed to 'Method'  If this is null, d20 must be zero
  DWORD d20;  // passed to 'Method'  
};

And here is AmdDxGsaFreeCompiledShader:

atidxx32!AmdDxGsaFreeCompiledShader:
53581bc0 55              push    ebp
53581bc1 8bec            mov     ebp,esp
53581bc3 ff7508          push    dword ptr [ebp+8]
53581bc6 a19044dd53      mov     eax,dword ptr [atidxx32!AmdDxExtCreate+0x3a95e0 (53dd4490)]
53581bcb 6a00            push    0
53581bcd ff7004          push    dword ptr [eax+4]
53581bd0 ff156c60a453    call    dword ptr [atidxx32!AmdDxExtCreate+0x1b1bc (53a4606c)]
53581bd6 5d              pop     ebp
53581bd7 c3              ret

It’s not really necessary to pick this one apart, the important things are that it doesn’t appear to return anything, and only appears to take one argument. It should be easy to figure out what that argument is.

Step 3: A Shim DLL

At this point, we’ve learned about all we can by inspecting the code. We know enough about the function signatures to write a wrapper DLL that will let us see what’s being passed. All we have to do is make a DLL that exports these two functions, name it atidxx32.dll, and put it someplace where CodeXL find it. We then rename the real atidxx32 to something like ‘real_driver.dll’, and have our shim DLL load it and forward the calls. Everything is done with pointers, so as long as we make sure to pass them through correctly, we can do whatever we like in the meantime. This lets us do things like logging and setting breakpoints. For example, we can do this:

typedef DWORD (CALL *COMPILE_SHADER)( esi_struct*, edi_struct* );
typedef void  (CALL *FREE_SHADER)   (void*);
static COMPILE_SHADER   g_pCompileShader = 0; // initialized by a DLLMain
static FREE_SHADER      g_pFreeShader = 0;
 
#define CALL __cdecl
__declspec(dllexport) DWORD CALL AmdDxGsaCompileShader( esi_struct* esi, edi_struct* edi )
{
    HRESULT h;
    Log("There is a stack variable at: 0x%08x\n",&h);
    Log("Before:\n");
    esi->Print();
    edi->Print();
    h = g_pCompileShader(esi,edi);
 
    Log("After:\n");
    esi->Print();
    edi->Print();
    return h;
}

Then we run CodeXL and have it compile a bunch of different shaders with a bunch of different asics. We get a whole bunch of results like this:

There is a stack variable at: 0x0028dc44
Before:
esi_struct at: 0x0028ded8 { 0x0000006e 0x00000029 0x003e7570 0x00000040 0x0028dec0 0x00000000 }
edi_struct at (0x0028decc): { 0x0000000c 0x00000000  0x00000000 }
After:
esi_struct at: 0x0028ded8 { 0x0000006e 0x00000029 0x003e7570 0x00000040 0x0028dec0 0x00000000 }
edi_struct at (0x0028decc): { 0x0000000c 0x0092ca70  0x000001b8 }

Now we look for patterns in the data. First off, CodeXL is always putting its esi/edi structs right next to one another. The edi_struct is obviously being filled with a pointer and size. That’s where our output is going. Furthermore, the pointer in the edi_struct is the one that’s eventually passed to FreeCompiledShader later on, so that part’s taken care of. The first dword appears to be identifying the asic family. It is always 110 for chips that the AMD disassembler thinks of as asic (SI). It is always 120 for asic(CI). By looping over a large space of possible values, I was able to figure out that 125, 130, and 135 are also accepted by the driver, but I’m not exactly sure what these are. One thing is certain, 130 and 135 are not using the same instruction coding as the others.

The second dword is also constant for a given asic, but changes for asics within the same asic family, and unlike the first one, the driver is perfectly happy to accept whatever nonsense I pass in (even zero IIRC). I haven’t gone looking for the meaning of this value yet, I figured its enough just to pass the values I saw.

The third and fourth are clearly a pointer and a size. That’s where our D3D bytecode is. The last two parameters are a pointer to somewhere on the stack, and a zero. That stack pointer only points 12 bytes ahead of the edi_struct. It might point to something interesting, but I’ve found that I can null it out and the driver is perfectly content, as long as that last dword remains zero. My guess is that CodeXL is just leaving it uninitialized (as well they should, since it doesn’t have to be).

The input still takes a little more work. We know where our bytecode pointer is, but it turns out that it doesnt’ point to the same place we got back from the D3D compiler. The D3D bytecode format is documented in: d3d11tokenizedprogramformat.hpp, which is part of the windows driver kit, but what they don’t tell you is where in the blob to look for it. Adding a d3d blob hex dump to pyramid, I was able to figure out that the blob is just a series of sections identified by magic 4CC codes. It turns out that I’m not the only person who has figured this out. See here, for example. The blob pointer that the driver’s getting points into a chunk named ‘SHDR’ or ‘SHEX’, depending on the shader model. The pointer that CodeXL passses to the driver is located at the beginning of the SHDR or SHEX section. It’s very easy to see this if you just look at an ascii version of the blob, and I added a hex view to Pyramid for this purpose.   The section headers stand out quite clearly:

bytecode

Next, let’s try and figure out what the output looks like. This was, surprisingly, the easiest part. Looking at that address in the memory dump, the first thing I see are the letters: ‘E”L”F’. ELF stands for executable and linkable format. It’s a widely used format for executables and for feeding object code to linkers. Now that I know that they’re returning an elf, I just need to grab some code from the internet and start examining the returned elf. Their elfs are well formed, and contain a section called ‘.text’ and ‘.text’ contains the exact same bytes that CodeXL shows me in its disassembler.

Now that we know what the input and outputs look like, there’s one more thing we need to check on. There is still the possibility that the esi_struct is larger than we think it is. In order to make certain that we’ve got the signature correct, let’s surround our structs in a sea of chaos:

static HRESULT Test(  esi_struct* esi, edi_struct* edi )
{
    const size_t PAD_SIZE   = 2048;
    unsigned char* p = (unsigned char*) malloc( PAD_SIZE );
    for( size_t i=0; i<PAD_SIZE; i++ )
        p[i] = rand();
 
    size_t offs = rand() % (PAD_SIZE-(sizeof(edi_struct)+sizeof(esi_struct)));
    edi_struct* new_edi = (edi_struct*) (p+offs);
    esi_struct* new_esi = (esi_struct*) (new_edi+1);
    *new_edi = *edi;
    *new_esi = *esi;
 
    HRESULT h = g_pCompileShader(new_esi,new_edi);
    if( !SUCCEEDED(h) )
        Log(" FAIL\n");
 
    *edi = *new_edi;
    free(p);
    return h;
}
 
__declspec(dllexport) DWORD CALL AmdDxGsaCompileShader( esi_struct* esi, edi_struct* edi )
{
    edi_struct first_one = *edi;
    g_pCompileShader(esi,&first_one);
 
    for( int i=0; i<10000; i++ )
    {
        edi_struct current = *edi;
        Test(esi,&current);
        if( current.nDataSize != first_one.nDataSize || memcmp(current.pHeader,first_one.pHeader,current.nDataSize) != 0 )
            Log("Results differ\n");
        g_pFreeShader( current.pHeader);
    }
 
    *edi = first_one;
    return S_OK;
}

That worked, so I’m pretty confident there’s no surprises waiting for us.

Step 4: Profit!

We now have everything we need to get shader code directly out of the driver. We first compile our blob. Then we search it for SHDR or SHEX. Then we read the bytecode size from the DWORD next to that. Then we load the driver DLL, GetProcAddress for AmdDxGsaCompilerShader, fill out our esi and edi structs, and do the call. We get back an elf binary, locate its ‘.text’ section, and there’s our ISA code, all ready to be disassembled.

All that’s left to do now is write a GCN disassembler. AMD has been kind enough to publish extensive documentation about their instruction set, so it’s a simple matter. Well, simple in one sense. It’s a very large instruction set, and there are a few errata in the docs (for example, the VOP3 forms of VOPC opcodes are apparently VOP3A, not VOP3B as the docs claim). Still, considering how big a document this is, it was remarkably straightforward to do what I needed with it.  The most difficult part of this process was copy-pasting all of the opcodes from the PDFs into a source file.

Since I don’t ever intend to write a corresponding assembler, I tried to make my disassembly a bit more elaborate, and include visual hints about whats going on.  You can grab the latest Pyramid and give it a try.