OpenGL, Followup

Lots of traffic.  Aras has compiled a nice collection of recent posts.  Michael has shared some war stories about shader compilation, which I take as a welcome sign that I’m not completely nuts….

I want to thank Timothy Lottes for his very cordial direct reply, and follow up on some points he raised.  Italicized text is from his post.

Driver quality.

“This same process could work equally well for GL.” It could indeed, it just takes investment. As I said, a solvable problem.

Agreed, certainly a solvable problem, but the chicken-and-egg problem has to be solved first.

Shader compilation

(a.) DX byte code lacks support for features in {AMD,Intel,NVIDIA} GPUs.
(b.) GLSL supports those features through extensions.
(c.) DX has no built-in extension support.
(d.) DX byte code is a “vector” ISA, and modern GPUs are “scalar”.
(e.) DX compile step removes important information for the JIT optimization step.
(f.) GLSL maintains all information useful for JIT optimization.
(g.) DX does not require parsing.
(h.) GLSL offline compiled to minimal GLSL does not require expensive parsing.

I agree with the DX/GL breakdown.  To be clear, I’m not advocating D3D byte code as the solution.  D3D IR has many problems.  It’s important we seperate the D3D bytecode’s drawbacks from bytecodes in general.  GL has an opportunity here to apply all the lessons learned from the DX one, and I think it would be a huge boon to the industry if we could get this done (maybe even get GL/DX using the same one eventually).

I’m planning to write about bindless seperately, so I’ll defer the whole sampler/texture discussion for now.

Threading

(f.) Vertex attribute fetch. Switch to explicit manual attribute fetch in the vertex shader. No more binding fixed function vertex buffers. Again can use another thread to write bindless handles to vertex data into a constant buffer.

I don’t think we can get away from VAOs. There is a sizable hardware base out there that still has special purpose hardware for “vertex buffers”. I wish they didn’t, but their architects insist its more efficient. We can apply the ‘giant buffer’ idea here too, but really it would more like several giant buffers for different vertex formats and/or assets, and there’s always the possibility that they might end up being interleaved in the draw stream.

(i.) The majority of what is left is the following sequence {change shader(s), bind constant buffer(s), draw} when a material changes, or {bind constant buffer(s), draw} when drawing a new mesh for the same material. The binding of constant buffers uses the same buffer each time with a different offset, a case which is trivial for a driver to optimize for. ……I highly doubt these 50K draws/frame at 33ms cases involve 50K shader changes, because the GPU would be bound on context changes. Instead
it is a lot of meshes, so that version of multi-draw would cover this case well.

Constant changes are definitely the most important problem. There is at least one constant rebase per draw. I’d love to get a fast path for this that didn’t require tayloring the shaders to it. Shader changes are going to be less frequent but still important.  The GPU rate is probably higher than the API rate right now, and the hardware won’t be improved until the software gets out of the way.

“So with the exception of (i.), GL already supports “threading”, and note there is no suggestion of using the existing multi-draw or uber-shaders.”

Kindof, but not really, since (i.) is where most of the actual cost comes from (apart from moving all the constant data, which I agree is really not an API problem anymore).

“Another issue to think about with “threading” is the possibility of a latency difference between single threaded issue, and parallel generation plus kick. In the single thread issue case, the driver in theory can kick often simply by writing the next memory address which the GPU front-end is set to block on if it runs out of work. Parallel generation likely involves batching commands then syncing CPU threads, then kicking off chunks of commands. This could in theory have higher latency if the single thread approach was able to saturate the GPU frontend on its own (think in terms of granularity before a kick).”

This is good food for thought.  I don’t see why parallel generation couldn’t kick earlier, maybe with some HW changes. Multiple threads can kick independent draws freely, using barriers to deal with draw order constraints. Or we could appoint one thread as the ‘kicker’ and have it coordinate the submits from the other threads (and build CBs itself when its idle).

I have vaguely similar concerns about multi-draw, because it means going wide to build these multi-draw lists, syncing, and then sending them to GL serially to build the CB. Compared to ye olde “state sort and draw”, all of the state changes will still be there with multi-draw. We have a similar serial cost to turn our multi-draw lists into a command buffer, but with the added cost of building those lists and bucketing our batches into multi-draws.  This sounds a lot like the way DX11 Deferred Contexts ended up behaving, and while this will have lower constant factors, it didn’t scale before (see here, top bar graph).

“If there is possibility to complete a given amount of work on one thread in the same walk clock time it takes another API to complete in many threads, the best answer might be to stick to one thread and do the work efficient approach.”

Possibly, but then either we or the driver want to buffer things and overlap the submit with the next frame’s simulation.  It doesn’t work out if the submission time is higher than the time it takes to simulate the next frame on whats left of the remaining cores. Above that line, we can only improve performance by going wide on submission. At the moment, we don’t have this option, which means that core-rich CPUs or lightweight simulations end up getting bottlenecked on it.