Let’s Close the Buffer Zoo

DISCLAIMER: This post is not about comparing Metal to DirectX. Metal is not better than DX12. DX12 is not better than Metal. Their goals and design philosophies are different and that is ok. I like and dislike both Microsoft and Apple in equal measure.

In this post, I’m going to pontificate on the way that buffer resources are handled in DirectX. There are five different types of buffer objects in HLSL shaders, and each type requires that shader authoring and resource setup be done in slightly different ways. The five buffer types vary greatly in their functionality, performance, and ease of use. In this post, I’m going to walk through the current API design for buffers and suggest some revisions that I think will make our lives easier.

Full disclosure: We all know there’s a new HLSL compiler being worked on. I haven’t seen it, but if you have, I’d love to hear how it compares.

Before wading into background material, here’s a quick summary of my position:

  1. Metal’s buffer binding model is right. A single buffer object bound at the API. Address space specifiers in shaders.
  2. Metal’s pack/unpack intrinsics are right.
  3. Metal’s use of pointers may not be right. A different abstraction might be better.
  4. Metal needs typed buffers
  5. Structured (swizzled) buffers should be directly exposed and controlled by the shader writer.
  6. Append/Consume buffers need to go away and be replaced with a proper GDS API (SM6 seems to be moving in this direction).
  7. The time has come to allow buffer resource VAs to be passed as data to shaders.

Now, lets start by talking about all the species of buffers in today’s HLSL.

Constant Buffers

Constant buffers are probably the most familiar one. You declare a big block of globally visible data items and bind them to a constant buffer slot at the API, then reference them directly in a shader. Constant buffers are read-only in a shader and have upper bounds on the amount of data that can be addressed. Constant buffers are the best choice for when data is uniform across shader invocations, because hardware has special optimizations for this usage pattern (dedicated caches, “push constants”, and such like).

For example:

struct Something
{
    float2 x;
    float2 y;
};
 
cbuffer foo
{
   float3 Field0;
   float Field1;
   float4 Field2;
   Something things[2];
}
 
float4 code() : SV_Position
{
    return things[0].x.xxxx + things[1].y.xxxx + Field2.xyzw;
}

Constant buffers have a strict upper limit on the size of the data that they can expose, and are required to be bound to shaders with 256 byte alignment,. The first restriction makes sense, IMO, but the second does not. It happened because one hardware vendor requires this and doesn’t want to change it. This is unfortunate, because per-object constant payloads are often far smaller than 256 bytes, and the amount of waste from padding can be considerable.

Typed Buffers

Or “uniform texel buffers” as Vulkan calls them.

These are bags of texel elements that are addressed with an element index, and support format conversion at load. So for instance you could write one piece of code like:

Buffer<uint4> buffer : register(t0)
uint4 data = buffer.Load(0);

You an bind a R8G8B8A8_UINT buffer, and use it to load sets of four-byte elements, each one zero-extended to 32 bits, or you can use an R16G16B16A16_UINT buffer and load shorts, or an R32G32B32A32_UINT buffer, and load full ints. Conversion is handled at load time, and the shader doesn’t have to know how many bits the source data had.

You can also use a float4 buffer and bind R8G8B8A8_UNORM, R16G16B16A16_FLOAT, R10G10B10A2_UNORM, and many others, and the shader doesn’t need to know. Being able to abstract the storage format from the shader is sometimes useful. A use case I’ve found in shipping code is fetching index data from integer typed buffers, and varying the index width (8bit, 16bit, 32bit), without the shader needing to know about it. This is convenient, and avoids overhead from shader permutation and/or branching on the index size.

There’s a second useful feature of typed buffers that doesn’t get enough recognition. Back in the DX10 days, when these things were introduced, somebody decided that cbuffer and tbuffer were so similar that they added matching syntax for loading data from typed buffers. You can take the cbuffer code above, change cbuffer to tbuffer, and it will behave exactly the same, except that it will expect to do its loads from an SRV buffer instead of a constant buffer, and unlike a constant buffer, there is no arbitrary upper limit on the size of the buffer, which means that, with a little cheating, you can use them for pushing structured instance data like so:

struct SomeData
{
    float Something;
    float4 SomethingElse;
    float4x4 Bones[32];
};
 
tbuffer SomePerInstanceData : register(t0)
{
    // HLSL does not have zero sized arrays, but we can fake it with size 2.
    // Size 2 is necessary to force the compiler to generate a dynamic indexing.
    //  "Out of bounds" indexing works fine as long as the underlying buffer is big enough.
    SomeData g_Instances[2]; 
};
 
float4 SomeCode( float4 pos : position, 
                              uint4 boneIndex : BoneIndex, 
                              uint id : SV_InstanceID ) : SV_Position
{
     return mul( pos, g_Instances[id].Bones[boneIndex.x] ) +
                mul( pos, g_Instances[id].Bones[boneIndex.y] ) + 
                g_Instances[id].SomethingElse;
}

Using this syntax in shaders is much simpler than trying to use a Buffer and loading everything manually. Unfortunately, this syntax forces all the loads down the load with format conversion path, which on many chips is not optimal.

Byte Address Buffers

This is a “bag of bytes” which we load directly and re-interpret to whatever data types we need. Except that it’s not actually a bag of bytes, because the addresses are all required to be dword aligned and the data are loaded at dword granularity, which makes it a bag of dwords whose indices are all multiplied by 4 for no particular reason.

Unlike typed buffers, byte address buffers always load dwords directly from memory. This is faster for a lot of hardware, because it avoids the format conversion path and in some cases is backed by a completely different cache. Unfortunately, this buffer type is extremely painful to use because it requires manually loading every dword. Here is an example:

 
 
// Using TBuffers
tbuffer foo : register(t0)
{
   float4 other_thing;
   float4x3 mBones[43];
};
 
float4 TBuffer( float3 P : pos, uint bone_idx : idx) : SV_POSITION
{
    return mul( mBones[bone_idx], P ); 
}
 
// Using Byte address buffers with same memory layout
 
ByteAddressBuffer b : register(t0);
 
float4 ByteAddress( float3 P : pos, uint bone_idx : idx ) : SV_POSITION
{
     float4x3 mTransform = {
	asfloat( b.Load4(16+48*bone_idx) ),
     	asfloat( b.Load4(32+48*bone_idx) ),
     	asfloat( b.Load4(48+48*bone_idx) ),
     };
 
     return mul( mTransform, P );
}

This is like being forced by write C code without being allowed to use the ‘struct’ keyword and having to use byte offsets for all array indices. It’s pretty rough, so I’d venture to guess that people don’t often do it. The problem with such code isn’t that it’s hard, it’s that it’s more code, and code is the enemy of maintenance. If you want to move fields around in a structure you have to go hunt for all the code that depended on that implicit layout and touch it all, and it’s very easy to miss something and screw this up. The ‘tbuffer’ syntax could have been implemented for byte address buffers, but wasn’t, which is pretty unfortunate. Somebody could write a meta-language that compiles everything down to ByteAddressBuffer code, but as far as I know nobody’s done it yet.

If you are an IHV engineer and are puzzled about why apps prefer TBuffers over byte address buffers, this is the cause. Your performance problems are being caused by a usability problem in the shading language.

Structured Buffers

A structured buffer is a buffer resource that stores elements of a struct type. It cannot contain primitive float/vector types. The structure fields are loaded directly from the buffer without any format conversion. Functionally, a structured buffer is exactly the same as a byte address buffer, in that you implement exactly the same code using one or the other of them. It’s just that the API for creating them is completely different and requires you to specify an explicit structure stride, instead of just allocating memory and interpretting it however you want. Structured buffers have a stride limit of 2048 bytes, which is good enough for storing small blocks of data (e.g. particles), but too small to make them a complete replacement for ‘tbuffers’ as discussed above (there is no structure size limit on a tbuffer, at least not that I know of).

Apparently, the motivation for structured buffers was that some hardware has special fast paths for “swizzling” structured data. Consider the following:

struct foo
{
	float4 bar;
	float4 baz;
	float4 glut;
	float4 merde;
};
 
StructuredBuffer<foo> tfoo;
RWBuffer<float> output;
 
[numthreads(64,1,1)]
void main( uint3 id: SV_DispatchThreadID ) 
{
     output[id.x] = tfoo[id.x].bar.x + 42;          
}

This is a pretty terrible access pattern because every SIMD lane is guaranteed to pull from a different cache line. It would be better if the data were laid out in a swizzled SoA form, as shown below:

// Data layout:
 
// struct is "swizzled" with a stride of 64
//  Groups of 64 structures have their data elements  transposed:
//
// x0 x1 x2 ... x63
// y0 y1 y2 ... y63
//...
//... repeat for the other 14 elements in the struct...
// ....
//  Next group of 64 structs starts here
//x64 x65 ... x127
//y64 y65 ... y127
 
// Corresponding code:
 
Buffer<float4> data;
RWBuffer<float> output;
 
[numthreads(64,1,1)]
void main( uint3 id: SV_DispatchThreadID ) 
{
     uint msb = id.x / 64;
     uint lsb   = id.x % 64;
     uint offs  = 16*64*msb + lsb;
     output[id.x] = data[offs] + 42;          
}

With this layout, cache coherence between SIMD lanes is greatly improved. There exists hardware which knows how to take a linear structure address (index and byte offset) and swizzle the memory layout so that the struct members are automatically stored in this form. Both Intel and AMD hardware supports something along these lines. For more information, search the GCN3 ISA docs for “index stride” or the Intel PRMs for the STRBUFF surface type. Nvidia probably does too but they don’t like telling people how their hardware works.

My problem with all of this is that memory layout is something that really ought to be under the full control of the app. Swizzling is not always the optimal layout. In the event that shaders always read all of the structure at once, an AoS layout can be pretty good, because everything that’s loaded is eventually used, and the compilers can vectorize the load instructions. For example:

struct foo
{
	float4 bar;
	float4 baz;
	float4 glut;
	float4 merde;
};
 
StructuredBuffer<foo> tfoo;
RWBuffer<float4> output;
 
[numthreads(64,1,1)]
void main( uint3 id: SV_DispatchThreadID ) 
{
     output[id.x]= tfoo[id.x].bar + tfoo[id.x].baz;          
}

Compiles to this on GCN:

            S_LSHL_B32     s0,   s12,     6
             V_ADD_I32     v0,    s0,    v0   
TBUFFER_LOAD_FORMAT_XYZW  v1[4],    s4[4] [v0*STRIDE] 32_32_32_32_FLOAT 
TBUFFER_LOAD_FORMAT_XYZW  v5[4],    s4[4] [16+v0*STRIDE] 32_32_32_32_FLOAT 
             S_WAITCNT     vmcnt(0) 
             V_ADD_F32     v1,    v1,    v5   
             V_ADD_F32     v2,    v2,    v6   
             V_ADD_F32     v3,    v3,    v7   
             V_ADD_F32     v4,    v4,    v8   
BUFFER_STORE_FORMAT_XYZW  v1[4],    s8[4] [v0*STRIDE]  
              S_ENDPGM

If you want to see what Intel’s compiler does, go pester your devrel contacts.
.
Locality for linear buffers is not as good when the indices are coherent, but linear can be better if the thread->struct mapping is not coherent, or if the elements loaded are “dynamically uniform” in the shader (as when looping over lights). In that case, swizzling may make things worse because the cache footprint will be larger. Buffer swizzling is really just a special addressing mode that can optionally be used on buffers, and its something that ought to be placed under application control. The optimal buffer layout is a complex function of warp/wave size and the properties of the passes that produce and consume the data. There’s no way that a driver can simply guess it for us.

Append/Consume Buffers

Yet another kind of buffer, but this one has a magic “hidden counter” under the covers that was pushed into the API because AMD has a global data share. Instead of directly exposing a GDS in the API, and forcing the other vendors to emulate it with ordinary memory and atomics, we constructed a narrow high level feature around particular use case (append buffers for OIT and particles).

Then we added the converse (Consume buffers for ????) , because, well, I guess it just seemed like a good idea. To see how silly the consume buffers are, consider how you’re supposed to dispatch threads to service its results. You do a bunch of appends, copy the counter into an IndirectArgs buffer, and then dispatch that many… threads…? No, it has to be a thread group count, which means you need to kick ONE compute wave to round ONE DWORD up to a multiple of the group size, unless you want to use a single-element thread group, which you’ll soon regret, because your processing will run terribly on some hardware because those drivers just stick it into a single-thread warp.

The append/consume pair should be replaced by an API that simply exposes the GDS, and the full range of associated operations, and requires hardware without a GDS to emulate it with memory, which is what they’re already doing anyway to make the global counter stuff work. If we had the ability to do things to the GDS other than simple increment/decrement, we’d have a lot more flexibility. Those spurious extra waves I complained about might still need to be there but at least it would be clear why they were there, and we could even optimize them by having the rounding shader pull directly from GDS, instead of requiring a round trip through the cache.

The Metal Model

If you’re not familiar with the metal shading language, I’d encourage you to have a look, because it’s very different, and closer in some ways to what I think we want. Here’s an HLSL compute shader that calculates transformed triangle normals:

cbuffer Globals
{
     float3x3 mXForm;
     uint BaseIndex;
     uint BaseVertex;
};
 
Buffer<float3> VertexBuffer;
Buffer<uint> IndexBuffer;
RWBuffer<float3> Out;
 
[numthreads(64,1,1)]
void main( uint3 tid : SV_DispatchThreadID )
{
	uint i0   = IndexBuffer[ BaseIndex + 3*tid ];
	uint i1   = IndexBuffer[ BaseIndex + 3*tid+1 ];
	uint i2   = IndexBuffer[ BaseIndex + 3*tid+2 ];
	float3 v0 = VertexBuffer[BaseVertex+i0];
	float3 v1 = VertexBuffer[BaseVertex+i1];
	float3 v2 = VertexBuffer[BaseVertex+i2];
	Out[tid] = normalize( mul( mXForm, cross( v1-v0, v2-v0 ) ));
}

Here is corresponding metal code:

// DISCLAIMER:  This has not been test-compiled
struct Globals
{
   float3x3 mXForm;
   uint BaseVertex;
   uint BaseIndex;
};
 
kernel void
calc_normals(const device float3 *VertexBuffer [[ buffer(0) ]],
 			 const device uint *IndexBuffer [[ buffer(1) ]],
 			 const constant Globals *Globals [[buffer(2) ]],
 		         device float4* Out[[ buffer(3) ]],
 	                 uint tid [[ thread_position_in_grid ]])
{
	uint BaseVertex = Globals->BaseVertex;
	uint BaseIndex = Globals->BaseIndex;
	uint i0   = IndexBuffer[ BaseIndex + 3*tid ];
	uint i1   = IndexBuffer[ BaseIndex + 3*tid+1 ];
	uint i2   = IndexBuffer[ BaseIndex + 3*tid+2 ];
	float3 v0 = VertexBuffer[BaseVertex + i0];
	float3 v1 = VertexBuffer[BaseVertex + i1];
	float3 v2 = VertexBuffer[BaseVertex + i2];
	Out[tid] = normalize(cross( v1-v0, v2-v0 ));	
}

The key differences are these:

  1. There are no “globals”. Everything is passed to the entrypoint function as a pointer and referenced from there. This flies in the face of established shader programming idioms, and makes source to source translation difficult, but not impossible. In order to deal with it in a translator you basically need to walk the call graph and inline everything during translation. You need to do that anyway in order to generate resource binding information. I understand why they did this but I don’t think it’s beneficial, just different.

  2. Constant buffers and device buffers are treated exactly the same way, using pointers, and the pointers are exactly like the pointers we’re used to in C, except that they’re qualified with a particular address space. The choice of address space is entirely up to the app, based on the expected frequency of update and access. Things that are always uniform across a thread group should be put into ‘constant’ address space. Things that will vary should be in the ‘device’ address space. Constant and device buffers are otherwise exactly the same, and are bound the same at the API. There is one wrinkle: On OSX, the two buffer types have different alignment requirements for binding, because the obstinate vendor’s hardware needs them 256-byte aligned and Apple didn’t put their foot down and force them to emulate the more sensible iOS requirement.

  3. The pointers are strongly typed. In the HLSL code, the IndexBuffer and VertexBuffer could have arbitrary element formats bound at the API, as long as their elements can be coerced into uint or float. This means that the same shader can be used for 16 and 32 bit indices. In the metal code, the buffer pointers are strongly typed. A uint is a uint, period, and if you want it to point to 16 bit elements you have to write ‘ushort’. If you want to use the same shader for both index widths you need to somehow communicate that to the shader and branch into a different path where the pointer is cast to ‘ushort’ instead (or else dispatch a seperate version at the API). This lack of support for opaque typed buffers makes full shader code portability between HLSL and Metal impossible.

One useful thing metal has, which DX currently doesn’t, is the ability to use sub-dword data types in data structures. There is no way in HLSL to create a 16-bit field in a constant buffer and manipulate it directly. You have to pack it into a uint and write code to manually pack or unpack it. (And no, using min16int does not do the trick). In Metal, you can just declare a ushort variable and it works the way you expect.

Re-Evolving Buffers

I think we should re-design buffers in HLSL from scratch. To do that, let’s deconstruct this and identify the functionality that software needs, and the functionality that hardware provides and wants to use. On the software side, we need to be able to do the following:

  1. Load and Store arbitrary data structures, including mixtures of bytes, shorts, and dwords, and floats of all sizes.
  2. Convert the data from packed form (unorm,snorm,uint8) to unpacked form (f32,uint32) for computation
  3. Convert from unpacked form back into packed form for storage
  4. Do atomic operations at appropriate granularity (local vs global).
  5. Load “typed” data from an array of packed elements whose format can vary at runtime (e.g. 16 vs 32 bit indices).
  6. Load data from “strided” or “swizzled” buffers.

This is all that software really needs. Apart from optimization, software doesn’t need to know or care how the data gets in or out of the shader. That part is up to the implementation, and we should assume for the purposes of design that implementations won’t totally suck.

On the hardware side, we have the following. Not all hardware possesses all of these things but we should keep them all in mind when deciding how we want this to work. We should assume we’re targeting an “every-machine” that contains the union of all hardware mechanisms, and design something that gives the uber-machine maximum freedom. The features of the “every-machine” are:

  1. Dedicated data paths designed for uniform constants, including “constant caches”, “push constants” and various other mechanisms.
  2. Opaque typed load instructions, where the element format is specified as part of a resource descriptor that is unknown at shader compile time. This is necessary for implementing DX10-style tbuffers
  3. Transparent typed load instructions, which load data at a particular address and perform free unpacking from a known packed type as part of the load operation.
  4. Untyped load instructions which can load dwords, shorts, or bytes directly and sign/zero extend them
  5. Unpacking instructions, which can take a dword and unpack it (e.g. extracting two float16s from a single uint)
  6. Structured load instructions, which can automatically apply a dword-aligned stride to an index.
  7. Lots of other weird addressing modes ( swizzled addresses, hashed addresses, other stuff), which are currently wrapped up in a resource descriptor.

For exploiting “constant” fast paths (1), I think that the memory spaces in Metal/Cuda/OpenCL basically get it right. We need the programmer to make the constant/device distinction in order to give the necessary hints about access pattern and update frequency. However, once we’ve marked things as ‘constant’ or ‘device’, the compiler can take over and figure out how best to make the data available. It can take statically indexed fields and map them into “push constants” or “constant caches” or whatever the hardware’s got, and it can do something different for dynamically indexed fields. It may or may not use the programmer-provided hints, and it can also compensate for the programmer to a degree. If you mark something as ‘device’ and only read it with a literal, dynamically uniform index, a compiler can figure that out and send that data through the ‘constant’ path instead.

All of the details in 3-6 are just instruction selection problems for the comipler people to solve. Consider the following HLSL code that loads four half-precision values and unpacks to float:

float4 LoadFP16x4( uint idx, ByteAddressBuffer buff )
{
   uint2 dwords = buff .Load2( 8*idx );
   return float4( f16tof32( dwords.x ),
                        f16tof32( dwords.x>>16 ),
                        f16tof32( dwords.y ),
                        f16tof32( dwords.y>>16 ) );
}

In theory, a compiler can turn this into 64-bit load followed by some up-conversion instructions, or typed load that reads 64-bits from memory and up-converts implicitly, or a pair of 32-bit loads and half2->float2 conversion instructions, or whatever else might be optimal. This is an instruction selection problem and compiler people are pretty good at solving these. We might do well to make their pattern matching problem easier by adding some more specific intrinsics to the language, as metal does. For example:

float4 LoadFP16x4( uint idx, ByteAddressBuffer buff )
{
   // maps to either typed load instruction or a dword load + conversion
   uint2 dwords = buff .Load2( 8*idx );
   return unpack_half4_to_float4(dwords); 
}

The general idea of “load plus unpack” is a pretty natural evolution from where we are to where we want to get. It also creates a clear distinction between storage formats, of which there are a rich variety, and compute formats, of which there are generally less, and it avoids complicating the type system with things like ‘unorm’ and ‘snorm’.

All of the weird addressing modes in (7) are, IMO, best addressed by some language extensions (see below), but I think that the shader, and ONLY the shader, should be the one dictating memory layouts. Implementations put it in the descriptor but API should not. If we add shader control over swizzling, GCN can implement it directly by loading a partial descriptor and patching it using SALU instructions. Less flexible hardware can and should be forced to emulate swizzling by doing all the address arithmetic, and its makers will then either scramble to add more flexible hardware, or else discover that the extra math really isn’t that bad.

The final consideration is what to do about the “opaque typed buffer” capability (2). It’s worth arguing about whether this feature is even useful, because we could always force shader writers to generate permutations. If we’re going to keep it around, the best approach is probably to keep it in the API as another class of texture resource, which seems to be the direction that Vulkan went in.

API Side Changes

At the API , I think we should keep “Typed Buffers” exactly the way they are, but eliminate Structured Buffers, Raw Buffers, Append/Consume buffers, and Constant Buffers. These should all be replaced with memory objects that are bound to the root of the pipeline and used as the shaders dictate. Memory objects should be bound by passing a GPU VA directly to the API using the moral equivalent of a root constant. The API can be something like: SetRootPointer( slot, gpu_va). This is not that far from the root descriptors that we currently have, but I’m in favor of making a two “cultural” changes to DX:

First, we should stop acting like constant buffers are special. A constant buffer is nothing but a block of memory that is accessed using special hardware paths. As long as the shader can mark what is considered “constant”, and as long as the API provides the address, then the driver will have all of the information it needs. It does no good to seperate the two at the API level because drivers may decide to send immediate indexed “device” data down the constant path, or random access “constant” data down the device path, rendering any API-level distinction moot.

Second, we should try and stop thinking about “bindless buffers”. As soon as we get the ability to pass GPU VAs indirectly (as data), then we won’t need bindless buffers anymore. We can pass a few pointers at the root level and then put whatever data structures we want into the corresponding memory, including massive arrays of other addresses, if that’s what we want. “Bindless buffers” is something that applications can implement for themselves once the memory model is relaxed. In addition, “Bindless CB” is a nonsense phrase to me, because all these special mechanisms we’re talking about just don’t work bindlessly by nature. The only way to make constants “bindless” is to use loads from memory, and at that point, our CB is just a B.

Shader Side Changes

At the shader side, I propose seperating the memory for buffer objects from the type information that interprets it. Let’s introduce new object types called ‘DeviceMemory’ and ‘ConstantMemory’, which behave like byte-address buffers in the corresponding address spaces. DeviceMemory and ConstantMemory are bound to shaders using descriptors. We can support loading primitive types directly from memory objects using a something like the existing ByteAddressBuffer syntax:

DeviceMemory mem;
uint d = mem.LoadDword(offset);
uint2 d2 = mem.LoadDwordx2(offset);
uint sexted_byte = mem.LoadByte(offset); 

We can and should also support passing DeviceMemory as data:

DeviceMemory mem;
DeviceMemory ptr = mem.LoadPointer(offset);

Next we’ll add a ‘MemoryView’. Memorys and MemoryViews are variables which are required to be initialized, are not assignable, and are passed by reference to function calls. A MemoryView behaves like a typed pointer in C, and supports indexing and dereferencing but IMO does NOT need to support pointer arithmetic.

At this point, it’s worth asking whether or not we should just add good old fashioned pointers to the language. We’re already most of the way there, so why not? I’m actually kindof indifferent, but I prefer a different syntax for a few reasons:

  1. I’m already used to accessing memory differently in shaders, and I don’t see much value add from pointers besides familiarity

  2. A more limited syntax might give stronger guarantees to the compiler

  3. Views allow complicated addressing modes like swizzled buffers to be directly expressed, while still adding some abstraction over the icky details. It’s not clear how to get there with pointers without adding a lot of additional baggage to the type system.

Here’s a bit pile of example code to illustrate what I’m thinking of. Syntax is provisional and somewhat arbitrary:

struct SomeStruct
{
   float thing;
   float otherThing;
   uint Unorm4;
   uint Snorm4;
   ushort Snorm2;
   ushort UNorm2;
   SomeOtherStruct SomeStructThing;
   DeviceBuffer pointer_inside_struct;  // This is a 64-bit memory address
    int SomeArray[42];
};
 
DeviceBuffer db, 	// A "device" buffer, referenced via a root descriptor
ContantBuffer cb,  // A "constant" buffer, referenced via a root descriptor
 
// global buffer views with compile time offsets are allowed, because, why not.
//  Lots of shader code relies globally visible data 
MemoryView<SomeStruct> GlobalView = cb.MemoryView<SomeStruct>(0);
 
// We can write code that operates on values in memory without knowing how its accessed
float Function( MemoryView<SomeStruct> struct )
{
    return struct->thing + struct[1].otherThing + 
               struct->SomeArray[13] +
               dot( unpack_unorm4(struct[0].Unorm4), unpack_unorm4(struct[1].SNorm4)) +
               dot( unpack_snorm2(struct[0].SNorm2), unpack_unorm2(struct[1].UNorm2));
}
 
void main( uint offset : offset ) 
{
    // primitive load/store can be done directly on the buffers themselves, if thats easier.
    //  This works exactly like the current 'ByteAddressBuffer'
    uint x = cb.LoadDword(offset)
    uint y = db.LoadByte(offset);  
 
    // We can also create a "View" and reference data at an arbitrary location
   MemoryView<SomeStruct> structs_constant = cb.MemoryView<SomeStruct>( offset );
 
    // We can also create the same view from a device buffer
    MemoryView<SomeStruct> structs_device = db.MemoryView<SomeStruct>(offset);
 
   // Create a view that swizzles struct members in rows of 8 dwords
   // HW who don't support swizzled addressing via descriptors are required to emulate it
    MemoryView<SomeStruct> swizzled = db.SwizzledView<SomeStruct>(offset,8);
 
   // Create a view that indexes at fixed intervals (regardless of type)
   // This would be useful if 'SomeStruct' were the root of an inheritance hierarchy
   //  Could be implemented as a descriptor whose stride is patched at runtime by the shader
   MemoryView<SomeStruct> strided = db.StridedView<SomeStruct>(offset, structs_constant.vertex_stride );
   strided[1] = 0; // 32 bytes from start
   strided[2] = 0; // 64 bytes from start
 
   MemoryView<float4> aliased = strided.Reinterpret<float4>( with_an_offset_if_we_want ); 
 
   // aliasing back gives us a skewed version of the original view.  This is probably an evil thing to do.
   MemoryView<SomeStruct> re_aliased = aliased.Reinterpret<SomeStruct>(0);
 
    // These function calls do the same thing, but fetch different data and in different ways
    //   based on the memory views that are passed in.  We assume full function inlining here.
    //
    //   If we require immutable views and full function inlining, then the compiler can infer the
    //      addressing properties and memory space for any view by walking up the chain
    //   
    float r0 = Function( structs_constant );
    float r1 = Function( structs_device );
    float r2 = Function( GlobalView );
    float r3 = Function( strided );
    float r4 = Function( re_aliased );
    float r5 = Function( swizzled );
}

One Comment

Comments are closed.