Vertex Cache Measurement

Some time ago I was inspired by Stephen Hill’s experiment which visualized the quad occupancy of pixel shaders.

Now that DX11 has given us UAVs in all the other shading stages as well, I decided to try the equivalent for the vertex cache. By “Vertex Cache”, I mean the Post-transform vertex re-use cache. That is, the thing which enables us to re-use vertex shading results across duplicated vertices in a mesh. If you’re not familiar with the post-transform cache, see ryg’s blog, which is my one stop shop for introductions to things I don’t feel like explaining in detail. You might also want to read Hughes Hoppe’s seminal paper on the subject. Of all of Hoppe’s extensive research contributions, this one has probably had the most widespread practical impact.

Using UAVs in a VS, we can use SV_VertexID to do an atomic increment into a buffer containing one counter for each vertex. An atomic inc is necessary here because we don’t actually know what the vertex distribution algorithm is, and we could theoretically process a given vert in more than one VS thread simultaneously. For that matter, HW could simply be duplicating all the verts. We won’t know until we’ve looked at the results. Using this approach, we end up with a buffer telling us the exact number of times that each vert was processed during the draw. From this, we can directly calculate the ACMR (average cache miss ratio) of the mesh. Assuming a deterministic assignment of vertices to hardware threads, we can also figure out the hardware’s caching policy by comparing measured data to a simulation, and searching for a caching algorithm whose per-vertex miss counts match the ones we measure.

Most research on vertex cache optimization has sortof waved its hands and assumed a FIFO cache which can hold some arbitrary number of processed verts. I sometimes wonder if the numbers are chosen arbitrarily based on what some hardware architect someplace happened to tell somebody, back in the year 2000 or so, because there is a frustrating lack of documentation about the real caching algorithms used on real chips nowadays.

Using guess-and-check, I was able to figure out that my Intel Haswell GPU iterates all of the indices in order and uses a 128-entry FIFO for vertex re-use. The cache size seems to be fixed at 128 entries regardless of how many vertex elements our shader exports. I’ve tried attribute counts ranging from 3 to 32 and they all result in the same hit counts. Now that I’ve figured this out, maybe they’ll go ahead and document it.

Note that this does not mean Intel is wasting a ton of space on the chip. They already allocate space in their URB to store processed vertices which are in flight, waiting to be consumed by the rasterizer, so the re-use cache just stores references to the corresponding vertex urb entries (VUEs) Their open-source documentation says as much. This implies that there’s hardware someplace in the chip that reference-counts all of the available VUEs and tracks how many in-flight triangles depend on each one, so that they don’t get recycled too soon. Either that, or there’s a very limited number of triangles in flight, such that all in-flight verts always remain in the cache, but that doesn’t seem very likely, because this doesn’t allow for much vertex shading parallelism.

All code is here. If anyone gets results from other chips, I’ll post them.