The opinions expressed in this post are my own personal views and are not endorsed, shared, or sanctioned by anybody in particular (especially my employer).
Rich Geldrich has a lot to say about this subject, and I agree with pretty much everything on his list. The present state of OpenGL is incredibly frustrating, and it has caused me to be much more blunt and rhetorical than I might normally be. There are those who think that OpenGL and not D3D aught to be the primary API target for PC game development. They believe that OpenGL and D3D are basically the same and that OpenGL gaming would just take off if we developers would be more open-minded. These people are mistaken. OpenGL is a bad investment for anyone with ambitious graphical goals.
Despite being available nearly everywhere, on the platform that gives us a choice, OpenGL is rarely chosen. There are three principle reasons:
OpenGL is highly fragmented across platforms. “Write-once run anywhere” is a myth. Mobile GL, Linux GL, Windows GL, and Mac GL, are all different from one another, and offer varying levels of feature support. While the current GL spec is at feature parity with DX11 (even slightly ahead), the lowest common denominator implementation is not, and this is the thing that I as a developer care about. An advanced spec is of no value if there are large fractions of the market that do not implement it. At this writing, the lowest common denominator feature set for desktop platforms is a restricted subset of GL4. Mobile (ES3) is even further behind, and is sitting where DirectX was 6 years ago.
OpenGL driver quality is highly variable, and lags abysmally behind DirectX. This is not hard to understand. DX games are the primary driver for GPU sales, so it is natural that the vendors direct their attention there. It is also, certainly, a solvable problem, but given the dominance of Windows for gaming, the IHVs have little incentive to solve it at present.
These first two reasons are both non-technical, and thus, ultimately irrelevant. They can be solved by throwing more resources at the problem. These issues are merely the result of lack of interest in OpenGL on the part of ISVs, IHVs, and, consequently, gaming customers. This lack of interest is the reason why problems 1 and 2 have not been solved yet. However, it is not the real problem, merely a symptom.
The real problem is that OpenGL, as designed, is inferior to its competitors in several very important ways, which I will spend the rest of this post laying out.
My intention here is not to offend or insult, (though I will be terse and sarcastic) nor is it to somehow harm OpenGL. I do not care which API wins, but if OpenGL is to win, it MUST correct its numerous flaws, and in my opinion, this should involve a dramatic redesign at nearly every level. OpenGL has gotten a lot of things right, but its most serious problems are not economic or political, they are technical.
GLSL Is Broken
The GL model, placing the compiler in the driver, is WRONG. It was a worthwhile experiment, one that seemed viable at the time, but history has proven it wrong.
The right model is a more open version of the DirectX model, a standard reference compiler which compiles to a high-level representation, which is then recompiled for a given device.
Driver compilation incurs unnecessary runtime costs. When I ship a product, I have been compiling my shaders dozens of times a day. I know that my shaders are well-formed and correct. All I need the driver to do is translate them into efficient executable code as quickly as possible. There is no value added by having the driver do semantic analysis. This removes value. The driver should not parse my shader, it should not validate my shader, it should not search for undeclared identifiers or missing semicolons. I have already done that ad nauseum, and if the driver is going to do it again when my game loads, it is going to increase my load times for no particular reason. Yes, I know that drivers can and do cache shaders, but not all of them do, and even if they do it is still better that they not have to miss the cache thousands of times on first run. The first run of my game is when the user experience is most important, and when the load times are most obvious. Caching, therefore, does not truly help me.
Driver compilation damages the platform by introducing divergence in shader syntax across implementations. It hurts driver quality by sucking up valuable engineering resources on irrelevant tasks. It would be better to have the IHV engineers spending their time improving code generation and compile times than worrying about syntax compliance and error detection.
Someone will mention optimization, and say something like “the driver is in a much better position to optimize, and will do it better.”
There exists a third-party tool which takes GLSL, performs well understood compiler transforms, and spits out other GLSL. This tool exists because there also exist GLSL compilers which are not doing their jobs. Yes, some implementors do a good job, and there’s no reason they couldn’t optimize a standardized high-level IR, SSA graph, or AST just as effectively, and at a considerably lower development cost. If we are forced to rely on individual implementors to fully optimize our shaders, then applications have no protection against poor implementations.
Someone will ask: “then why don’t you just optimize your code, graphics engineer?“. Ironically, all of the arguments against an implementation-agnostic compiler apply equally well to implementation-agnostic programmers. My reply is simply: “The compiler is in a better position to optimize, and it will do it better.” Now, if only that were true…
Someone will assert that an IR interferes with the compiler’s optimization abililty by removing information from the program. This may have been true of D3D bytecode, but it need not be. The compiler and IR can be designed in such a way as to eliminate information loss. A simple serialization of an AST would accomplish this goal, though there are probably better choices (e.g. SPIR, LunarGlass).
There are also certain optimizations which need to happen, and are fairly time consuming, but which drivers DO NOT need to implement. Dead code elimination and constant folding are the same no matter who does them, and are always profitable. If implementors can improve their compilers by having others do tedious work for them, then they should do so. My workstation is much better at this sort of thing than an end user’s phone.
Someone will mention GLSL extensions. Irrelevant. Extensions are orthogonal to the compilation model. The IR can be defined in such a way as to make it open and extensible (for instance, by adding new opcodes or data types). Extensions can easily be exposed by adding the relevant syntax to a standard reference frontend. If that doesn’t work, implementors can fork the reference compiler and define an extension to specify shaders using a proprietary IR in addition to the standard one. The paranoid ones can closed-source their fork it if they really want. Note that nothing in the above would prevent a particular application from embedding a compiler and doing runtime compilation to IR if it so wished. The compiler(s) can, and should, be designed with this use case in mind (another example in which DX has shown us the right way of doing things).
Threading is Broken
The single-threaded nature of current APIs is one of the principle reasons why PC games cannot scale well across multiple cores. We need the ability to freely parallelize our draw submission. We make thousands and thousands of draw calls. We have UI, we have trees and shrubs, we have buildings, we have terrain, we have various particle effects, we have lots of objects with lots of variety. We have multiple passes (cascaded shadows, reflections, Z prepass). We need, yes NEED an API that allows submission to be scheduled across cores. D3D11 attempted to solve this problem with mixed results. OpenGL has not even bothered to try.
Somebody will mention multiple GL contexts. This person does not understand what I am saying. By design, we cannot use multiple contexts to simultaneously submit multiple rendering commands destined for the same render target, and that is what I really want to do. Yes, I need to order them somehow, but that is my problem, not GLs. In many cases, the draw order is largely irrelevant, and there is much more efficiency to be gained by threading over batches (of which there are thousands) rather than passes (of which there are perhaps dozens).
OpenGL is not designed for this kind of rendering architecture, and it needs to be.
OpenGL also makes it extremely difficult to do asynchronous resource creation. Threaded resource creation is straightforward in D3D. The relevant calls on the device interface are free-threaded. In OpenGL, this is only possible through an elaborate multi-context dance. OpenGL needs a standardized, consistent way to perform asynchronous resource management.
Texture And Sampler State Are Orthogonal
Nearly every DX shader I ever write looks something like this:
sampler sDefault; Texture2D tColorMap; Texture2D tNormalMap; Texture2D tSpecMap; Texture2D tEnvironmentMap; // ... tColorMap.Sample( sDefault, uv ); tNormalMap.Sample( sDefault, uv ); tSpecMap.Sample( sDefault, uv ); tEnvironmentMap.Sample( sDefault, R );
In an entire application I often see less than 16 unique sampler states. It is possible to bind the same small set of sampler states to the pipeline and leave them alone for all eternity. This allows for a significant reduction in state change cost. It also makes it much cheaper to sample the same texture using multiple sampler states in the same pass (the texture can be bound once).
I have been told that some people’s hardware is slightly more efficient this way, but these people do not seem to have any trouble implementing DirectX. If they are that concerned about this, then they can and should change their hardware. Let’s take a look at the GPU ISAs for which we actually have documentation:
The AMD GCN ISA is publicly available here. If we examine the ISA and think about how the API would map onto it, it is easy to see that the OpenGL model requires more loads whenever the number of sampler states is less than the number of textures. In my experience this is basically all the time.
The relevant Intel docs for Haswell are publicly available here and here. It is much more difficult to navigate these docs (sorry guys), but eventually you will see that it’s basically a wash. The GL model would seemingly require more URB entries to be prefetched, was probably incurs some sort of cost, but its hard to tell how severe.
Nvidia does not publish their actual ISA (sadly), but given that they support both modes in PTX, it seems that it’s not that big a deal to them either.
UPDATE: Correction. Turns out the ISA is there, its just hard to find.
It has been 7 years since DX10 introduced this good idea, and GLSL still stubbornly requires the sampler state to be coupled to the texture state for no discernable reason. This adds API overhead, by forcing us to re-apply sampler state whenever we change the texture unit assignments. It is likely less efficient for a variety of contemporary GPUs, and it makes it very difficult to port contemporary HLSL to/from GLSL.
Too Many Ways to Do The Same Thing
In GL 4.4, there are two sanctioned ways to set up shaders. One is to use a program object. The other is to use a program pipeline object and attach shader stages piecemeal.
There are at least two sanctioned ways to configure the vertex stream. We can use glVertexAttribPointer and the ARRAY_BUFFER binding, or glVertexAttribFormat and glBindVertexBuffer.
There are two sanctioned ways to set up samplers. One is to use a sampler object. The other is to use the implicit sampler state that comes with every single texture object (and is set using glTexParameterXXX).
There are two sanctioned ways to create a texture. The right way (glTexStorageXXX) and the clunky old-school way (glTexImageXXX for each mip).
This redundancy is bad, because the more ways there are to specify state:
- The more confusing it is.
- The more room there is for drivers to get them wrong.
- The less efficient we are at deciding what the heck the state should be..
Let’s work through #3 in more detail. Consider the case of texture creation.
Say we do:
glGenTextures(&n, 1); for( mips ) glTexImage2D(n,mips[i]); // bind the texture and draw
When we draw with the texture, OpenGL specifies that we do a ‘completeness’ check, to make sure that we get a black texture if we screwed up.
UPDATE: Correction. Incompleteness is undefined behavior in 4.4 unless robust buffer access is enabled at context creation. Not sure what the implications are of an implementation supporting both.
Now suppose we did this:
glGenTextures(&n,1); glTexStorage2D(n,...); for( mips ) glTexSubImage2D(n,mips[i]); // bind the texture and draw.
We did the right thing, our texture cannot possibly be incomplete, but guess what, OpenGL does not know at draw time whether the name we gave it was allocated with TexStorage2D. As a result, we will always execute the moral equivalent of: if(is_complete), for every bind (it might be if(immutable), but its still a redundant branch).
The path we shouldn’t use impedes the performance of the one we should.
This brings us to our next point….
OpenGL’s Error Handling is Wrong
The OpenGL spec requires that nearly every API call must validate itself and set some state so that ‘glGetError’ will return appropriately. Implementations must do a good deal of tedious validation work in order to ensure conformance. Apart from bribing driver engineers, there is no way to get rid of this overhead. Every single OpenGL call is going to perform one or more conditionals in order to validate its input.
Yes, I know we have branch prediction, and yes, they predict well, but I’m executing hundreds of thousands of them. The branches still burn ICache space and consume execution resources. The BTB is only so large, and I’ve got enough branches in the renderer and driver already without every single API call adding a few of its own just in case I happen to screw up. By the time my game ships, I will not be screwing up, but OpenGL will still be limiting my performance by design.
And then there’s texture completeness. Need I say more about texture completeness? We can design that monstrosity away just by stripping glTexImage2D from the API. A thorough pruning will make the API smaller, more robust, and more efficient. It should be completely refactored to remove as many potential error conditions as possible. Those which remain should result in undefined behavior and should be detectable by an optional validation layer.
There Are Too Many Small Inefficiencies
There are quite a few small inefficiencies in OpenGL which are going to render its single-thread performance inferior to that of up and coming APIs. I’ve touched on some of them here, but I’m running long, so I intend to devote a followup post to this subject.
The short version is this: The API is littered with small inefficiencies and flaws. These flaws are due to a design philosophy which incorrectly emphasizes compatibility, tradition, and ease of use over implementation efficiency. These things might be tolerable if we had the ability to scale across cores, but we do not, and even if we did, we would still struggle to achieve a batch throughput anywhere near what DX12/Mantle will give us.
Somebody will point to this and suggest that we use instancing, or multi-draw + uber-shader, or texture arrays, or some combination thereof. All of these things assume a very specific software architecture, one in which the principle design point is to avoid using the API. Too many shader switches? Use an uber-shader. Too many texture swaps? Use bindless or arrays. Still too slow? Sort everything and use instancing/multi-draw. This is all that OpenGL can offer at present, and the presenters do a good job laying it out. My contention, however, is that this is insufficient. It is folly to call the API efficient if the only way to be efficient is to avoid using it. Much of the software state change cost can and should be eliminated by re-designing the API. Mantle has proved this principle. DX12 will soon set it in stone.
They Can Fix It
Compatibility and ease of use are both worthwhile goals if one is writing a high level graphics toolkit, but as I have written elsewhere, that is not what OpenGL really is. A graphics API is not for doing graphics, it is for abstracting GPUs. Graphics is done at a higher level. OpenGL, on many platforms, is the only means of accessing the GPU, and as such, it must get better at providing this essential service. OpenGL must stop trying to fill the high-level and low-level niches simultaneously, because in its present form it is not very good at either one.
OpenGL has a lot of good qualities. Its program object abstraction is a better model than the seperate shader objects from D3D. Its extension mechanism makes it the platform of choice for prototyping GPU features. It has occasionally exposed useful features which D3D lacks. Despite its advantages, it has played second fiddle to D3D for over a decade. The reason is that Microsoft has consistently and proactively improved on D3D, and is even now in the process of redesigning it from scratch, yet again. If Kronos and the OpenGL platform holders wish to become serious competitors in the high-end gaming space, they must be willing to do likewise.
OpenGL must be augmented by a new industry standard which is, lean, clean, modern, and performance-oriented. Luckily, we don’t have to look very far.