All views expressed herein are my own personal opinions are are not shared or sanctioned by anybody in particular (especially my employer).
I was going kick off my blog by writing about low level APIs, and then Dan Baker beat me to it. Go and read his post first.
It is a very interesting time in the real-time graphics industry. For a very long time, it has been accepted truth that the “batch” (defined as any number of state changes plus a draw) is inherently expensive, on the PC at least. A large body of real-time rendering tradecraft has been built up around the need to conserve drawcalls. Let’s think through some examples:
The sole purpose of instancing is to make it cheaper to draw a large number of copies of the exact same thing. Instancing is a logical thing to do, if the draw stream has got enough coherence in it, but it is not cheap to create such coherence where it does not exist. If alpha blending is involved, it may well be impossible. What’s more, instancing helps us push lots of objects, but it does not enable us to have much variety. The more distinct objects we add to our scene, the less relevant instancing becomes. There is a tradeoff here between CPU performance and flexibility.
Artists are forced/encouraged to cram things onto as few texture pages as possible. Why do we do this? One of the first things that comes to mind is that texture swaps are expensive, and one big page incurs fewer swaps than a bunch of smaller ones. There is a tradeoff here between CPU performance and artist productivity. As Dan rightly pointed out during the Mantle announcement, API overhead is so high that it is affecting the content creation process, and this is a symptom that something is wrong. It’s one thing if artists do this to make themselves more productive when painting textures, but its quite another thing if they’re doing it for performance. Artists should not be thinking about performance, for the same reason engineers should not do look development. It isn’t their job, and consequently they tend to suck at it.
They certainly have other uses, but avoiding state swaps is one of the bread and butter applications. Using texture arrays to save state changes implies constraints on the assets. All textures in an array must have a consistent size. This is sometimes easy to arrange, and sometimes not. There is a tradeoff here between CPU performance and flexibility.
One of the stereotype use cases is single-pass render to cubemap, or single-pass render to a bunch of cascaded shadow maps. The reason for doing so is to avoid API overhead. The drawback is that every primitive which touches any RT must be processed for every RT. Each of the view-projection transforms must still be done, and every primitive is either sent to the rasterizer, or perhaps subjected to a primitive-level culling test in the shader. In contrast, a DX9-style system can use cheap object-level culling tests to reject thousands of invisible primitives at a time. There is a tradeoff here between CPU and GPU performance.
One reason we do this is to avoid shader permutation hell, but another is to avoid swapping shaders, because shader swaps are expensive. Uber-shaders are a great way to make one’s shaders easier to maintain, but there are two important drawbacks. The first is that, contrary to popular belief, static (coherent) flow control is not completely free. The second is that on many GPUs, code that is never executed will still incur a cost, in the form of register pressure. The shader compiler must reserve enough registers to cover the worst case, and this means that an uber-shader is more vulnerable to memory latency than a specialized one. There is a trade-off here between API overhead, shader maintenance, and CPU/GPU performance.
There are two reasons we do state sorting:
- State changes harm GPU performance, by introducing pipeline bubbles
- State changes harm CPU performance, by adding API and driver work
Think about that second point for a second. We are doing CPU work in order to avoid doing CPU work. What if we could stop? We might still need to sort to some extent because of point #1, but perhaps we could be less aggressive about it.
Working Around the Problem
You should have noticed a theme. Every one of these things is, in some respect, a response to the problem of limited batch throughput. If we do enough of these tricks, then we can squeeze enough performance out of contemporary APIs, but they all incur some other cost in GPU performance and/or developer productivity. The conventional wisdom has been that this is just the way it is, and that a limited draw call budget needs to be a primary design consideration for a graphics engine.
The reason I am such a fan of AMD’s Mantle is that it has demonstrated, conclusively, that it does not have to be this way. Mantle has turned conventional wisdom on its head. It has shown us that if API overhead is a problem for applications, then it is the API, and not the application, which can be redesigned. It is impossible for me to exaggerate how important Mantle is. It is a paradigm shift. It is a disruptive technical development which will re-write the conventional wisdom in our field and alter the dynamics of the market. To plagarize myself: “It will re-write portions of the real-time rendering book. It will change the design of future APIs and engines and greatly enhance their capabilities.” It has already altered the course of real-time graphics history, and I don’t think its done yet.
What does “Low Level” mean?
Mantle has ushered in the age of the Low Level API. This will be an important inflection point in graphics history. The changes that will occur are dramatic, almost as dramatic as the shift from fixed function to programmable shading.
In my mind, here is what it takes for an API to be considered “low level”:
- Must offer control over memory placement, with minimal restrictions. Applications must be able to define coarse-grained memory spaces and manually position resources in them, subject to reasonble constraints.
- Must present an asynchronous command buffer model for state changes and draw submission
- Must allow fully parallel command buffer construction
- Must place synchronization and hazard tracking under application control
- Must do only as much work as is necessary to abstract the device.
Mantle is the only shipping graphics API which meets all of these criteria. D3D12 should come very close. OpenGL is way behind.
The reason that “batches” have historically been so expensive is because of the way that APIs have historically been designed. Ease of use, and the hiding of awkward details, has always been a key concern. The implementations have been made very complicated, in part so that applications can be easier to write. The lower parts of the stack, the kernel, runtime, and driver, have been doing a great deal of tedious work in order to present application developers with a relatively simple interface.
Consider what happens when you want to do something as simple as bind a texture to a shader. The system must do memory management (Is all the data in video memory? Did I page it out? Do I need to put it back?). Synchronization (are there pending draws to this texture? Are they done yet?). Validation (Is this the right kind of texture for the shader? Is this texture currently bound for output?).
Somebody, somewhere, needs to do all of these things, and because the runtime/driver do not know what the draw stream looks like, they have to be conservative. All of the above need to happen all of the time, and all of these small redundancies add up quickly.
Low level APIs mean that you get to do all of this yourself. For our example of binding a texture, you will have to issue a barrier to make sure that the texels have made it to the GPU. If you’ve rendered to it, you need to tell the hardware to synchronize before using it. If you want to change it, you’ll need to make sure the GPU’s not using it, and if it is, you’ll need to decide what to do about it. And if you screw up, you’ll get whatever insane results you get, up to and including a hardware hang. The driver and API are hanging up their spurs and handing the reigns to you. You are in control. You have all the power, and all the responsibility.
The good news is that low level APIs are going to empower game developers. We know things that the driver doesn’t. We know how often each resource is used. We know which resources are used together. We know exactly when our rendering passes will start and when they will end. We know when we need to synchronize and when we don’t. We have a much more complete picture than the driver does, and by handing the reigns to us, the driver’s job is going to become a lot easier. This simplicity will result in significantly faster drivers. It will also result in far fewer driver bugs.
The bad news is that low level APIs are going to make graphics programming harder. Things that used to be driver bugs will become application bugs. Graphics engines will be even trickier to wrap one’s head around. If you think that OpenGL and D3D are too complicated as they are, then you will have a very difficult time keeping up. It will be considerably more difficult to write a quick, simple application with a low level API than with a high level one. It will be easier than ever to bring down the GPU.
I’m not entirely insane
You might think I’m crazy for advocating something that will make my life more difficult. I get a lot of skepticism on this point, but some of the skepticism is rooted in a misunderstanding about what an API is for. In the world of rendering engines, the job of the graphics API is not to simplify application development. The job of the graphics API is to provide access to the GPU, and to present a uniform feature set across current and future devices. The API does not need to make GPU programming easy, it just needs to make it POSSIBLE, and to stay the heck out of the way as it does so.
Many people that I encounter are under the mistaken impression that game developers use the graphics API to write graphics applications. Game developers who successfully target multiple platforms do not use the graphics API to write graphics applications. That is to say, we do not use the graphics Application Programming Interface to write graphics applications. If we use it at all, then we use it to write our own graphics API, which we then use to write graphics applications.
In every multi-platform codebase that I have ever worked in, the API is hidden behind an abstraction layer of some sort. We do to OpenGL what OpenGL does to the hardware. The API-level rendering code is a small fraction of the codebase, it is written once, and it rarely changes once it becomes stable. The interesting code, the part that implements the actual games and graphics algorithms, is always written against the internal abstraction so that it can be portable. It is only this internal abstraction which needs to be easy to use. In many cases, the abstraction is much easier to use than the underlying API, because an individual engine is able to make simplifying assumptions. When you look at things in this light, the benefits of a low level API become obvious.
Adding a new graphics API to a properly designed engine is a known quantity. The cost is mostly paid up front. It is not trivial, but it is not terribly expensive either. Once this cost has been paid, the maintenance cost is far lower, and the resulting code can be reused across a number of products. This re-use makes it cost-effective to use a low level API to ensure the best possible performance. If the cost can be amortized over multiple products, then it even makes sense to drop down to vendor-specific APIs, provided that such things are available and well supported. We do not need to speculate about whether ISVs would target the vendor-specific interfaces. They are already doing so.
But what about the beginner? Or the hobbyist? Or the indie? What about people who aren’t building full engines? The answer is that low level APIs simply do not work for this audience, nor should they. I expect we’ll see something like what Timothy Lottes calls for. We are going to see the graphics stack split into two levels. On one level, things will look like they always did, but on another level, there will be a race to the metal. The best performing paths are going to be low level. They will have to be, in order to compete with Mantle. Middleware, engines, OSes, and the open-source community will then come along and layer any number of slow, user-friendly abstractions on top of them. High level graphics will still exist, but maximum performance will only be possible at the low level. It is a very good time to be a professional engine builder.
On Vendor Specificity
Mantle gets a lot of flak for being “single-vendor”. It does not have to stay this way, and AMD is clearly willing to standardize it. There is NOTHING in its design that would prevent this. The hurdles involved are political, not technical. While I would obviously prefer one standardized API, I would be willing to use more than one, if it meant that my customers could achieve a significantly better experience. Vendor specific APIs have been done before and are worth at least a thought experiment.
Suppose that each IHV developed and maintained ONE interface for its products. Suppose that IHVs did nothing except interfacing with the kernel and maintaining their own cross-OS API for user mode code. Suppose that OSVs, ISVs, and the open-source world took responsibility for layering a standardized API on top of the vendor-specific ones, by defining standard top-level stacks with standard intermediate commands. This kind of architecture is not unreasonable. It is used here, here, and here. The thing that’s missing is a stable, accessible interface to the bottom pieces.
Suppose that the high level stacks were redesigned to be “piled on” instead of “plugged into”. Suppose that OpenGL were, in fact, a library instead of a core OS component. This arrangement would benefit everybody. IHVs win by having to maintain a thinner stack. They get to implement one user mode API per platform, and one or more thin translation layers which are completely reusable across new devices. They get to expose whatever extensions they like at their level without asking anybody for permission, and the extensions would be more easily ported across platforms. OSVs win by making their platforms more appealing to gaming customers. ISVs win by being able to write better games.
I’m sure there are a lot of things here that are not as easy as I make them out to be, and I’m sure it will take a long time to get to where I’m imaging, but in the absence of a single, standard, low-level API, it’s not a bad plan B. It defies the conventional wisdom, but the conventional wisdom, as we have seen, is subject to change.