Staring Into The Matrix

This is another educational post.  The target audience is anybody who has sat through an undergraduate graphics course, but still finds transform matrices to be unintuitive and mysterious. Or, anybody who feels like reading my lame movie puns.

To many aspiring graphics people, transform matrices are little more than mysterious, arcane sorcery. There is a vague notion that matrices can be used to convert between coordinate systems, but the effect is sortof a black box. In this post, I will attempt to tease out the intuition.

Unfortunately, nobody can be “told” what the matrix is. They have to see it for themselves.

You’re Here Because You Know Something.   What You Know You Can’t Explain…

Let’s begin by going back to the very basics. Consider the familiar cartesian coordinate system. The coordinates of a point in space tell us how far we have to travel along each of the coordinate axes axis in order to get from the origin to that point. Simple and intuitive. You already know this part.

We can write a rather redundant vector equation to describe this:

$P = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix}+\begin{bmatrix}1 \\ 0 \\ 0 \\ \end{bmatrix} x+\begin{bmatrix}0 \\ 1 \\ 0 \\ \end{bmatrix} y+\begin{bmatrix}0 \\ 0 \\ 1 \\ \end{bmatrix} z$

Now, let’s generalize it. A coordinate system (or “space”) can be described by an origin and a set of basis vectors. And a given point in that “space” is defined in terms of its relationship to these. Now our vector equation looks like this:

$P(x,y,z) = \mathbf{O} + \mathbf{X}x + \mathbf{Y}y + \mathbf{Z}z$

In our earlier example, we used this coordinate system:

$\mathbf{O} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \end{bmatrix}\mathbf{X} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ \end{bmatrix}\mathbf{Y} = \begin{bmatrix}0 \\ 1 \\ 0 \\ \end{bmatrix}\mathbf{Z} = \begin{bmatrix}0 \\ 0\\ 1 \\ \end{bmatrix}$

We need a name for this coordinate system, so let’s call it “space.” This is the coordinate system that we use when we aren’t thinking about which coordinate system we use. It is the “default”, the most fundamental of coordinate systems. However, space is more complex than it appears. There are an infinite number of other coordinate systems embedded in “space”, each of which is described by a different assignment of values to O,X,Y,Z.

If we want to, we can take any point in “space” and express it in terms of some “higher” coordinate system by mapping O, X, Y, and Z into the higher system, and then re-evaluating P. Now we have two “spaces”, so we need to distinguish them by calling the first one “object space”, and the second one “world space”. The function P is now a tranformation from object space to world space.

Note that changing coordinate systems and transforming an object are conceptually different, but end up being the same operation. Let’s look at simple 2D rotation as an example. Suppose I want to spin a point $\theta$ degrees around the origin. We want to derive the corresponding transformation function:

$P(x,y) = \mathbf{O} + \mathbf{X}x + \mathbf{Y}y$

By using the following diagram, and remembering our trigonometry, we see that P satisfies the following:

$P(1,0) = \begin{bmatrix} \cos\theta \\ \sin\theta \end{bmatrix} \\ P(0,1) = \begin{bmatrix} \cos(\theta+90) \\ \sin(\theta+90) \end{bmatrix} = \begin{bmatrix} -\sin\theta \\ \cos\theta \end{bmatrix}$

So:

$\mathbf{O} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\\mathbf{X} = \begin{bmatrix} \cos\theta \\ \sin\theta \end{bmatrix} \\\mathbf{Y} = \begin{bmatrix} -\sin\theta \\ \cos\theta \end{bmatrix} \\P(x,y) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \cos\theta \\ \sin\theta \end{bmatrix} x + \begin{bmatrix} -\sin\theta \\ \cos\theta \end{bmatrix} y$

So, it turns out that if we first find our rotated coordinate axes, we can use the rotated axes to compute the rotated location of any point. The new coordinates are computed by moving x units along the new X axis, and y units along the new Y axis. The original point (x,y) is in object space and the transformation yields a new point in world space which is its rotated location.

…Do You Know What I Am Talking About? …

Now, suppose we want to go the other way. Say we have a point P in world space, and a local coordinate system, as shown.

How do we go from our world coordinates (black lines) to the corresponding local coordinates (red lines)? This is the problem we end up having when we want to go from world space into camera space. Again, we can describe this as changing coordinate systems or as moving the world to align to the camera. In my opinion, the former is the more intuitive description. Imagining the world moving around us feels kindof arrogant.

The first thing we need to do is subtract off the origin to put our point relative to the local origin. I’m gonna change notation a little and write it as a vector-valued function this time:

$f(\mathbf{P}) = \mathbf{P} - \mathbf{O}$

Now, oftentimes when the textbooks teach viewing, they give an explanation that involves a whole bunch of rotations around various axes in order to get ourselves pointed the right way. Screw that. That’s complicated and error prone. Instead, let’s use dot products. They’re easy.

We know that the Z axis of our camera space is going to point down the viewing direction (or its opposite, if you’re a right-handed heretic). This gives us one of the basis vectors for the camera space, and we can use cross products and an ‘Up’ vector to extract the other two.

$X = \mathbf{Z} \times \mathbf{Up} \\Y = \mathbf{X} \times \mathbf{Z} \\$

Then what?? If I ever interview you for a graphics job, I will ask you to tell me several useful facts about the dot product. Here’s one: The dot product of two vectors A and B gives the length of the projection of A onto B, as shown:

If B is a unit vector then the dot product gives the “B-coordinate” of A. If it is not a unit vector, the result is scaled by the length of B. To see why, you can just imagine what would happen if B were one of the coordinate axes. The dot product degenerates to the value of the corresponding coordinate of A. For arbitrary B we can just rotate to align B to an axis and then remember that lengths are unchanged under rotation.

So, to finish converting our vector into camera space, we just dot it with each of our camera’s basis vectors:

$f(\mathbf{P}) = \begin{bmatrix} (\mathbf{P}-\mathbf{O}) \cdot \mathbf{X} \\ (\mathbf{P}-\mathbf{O}) \cdot \mathbf{Y} \\ (\mathbf{P}-\mathbf{O}) \cdot \mathbf{Z} \\\end{bmatrix}$

Do You Want To Know… WHAT .. IT … IS?

I’ve just gone to a great deal of trouble describing transformations in vector terms, but in graphics, we’re accustomed to using matrices for these things. Consider our 2D rotation example. We have:

$P(x,y) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \cos\theta \\ \sin\theta \end{bmatrix} x + \begin{bmatrix} -\sin\theta \\ \cos\theta \end{bmatrix} y$

This can be written in matrix form as:

$P(x,y) = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix}$

For the more general 2D transform, we use the usual homogeneous coordinate trickery to add a translation. If we have:

$P(x,y) = \mathbf{X}x + \mathbf{Y}y + \mathbf{O}$

We can write:

$P(x,y) = \begin{bmatrix} X_x & Y_x & O_x \\ X_y & Y_y & O_y\end{bmatrix}\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}$

We can re-write this matrix multiplication in vector form as:

$P(x,y) = \begin{bmatrix} X_x \\ X_y \end{bmatrix}x +\begin{bmatrix} Y_x \\ Y_y \end{bmatrix}y +\begin{bmatrix} O_x \\ O_y \end{bmatrix}$

We now see that multiplication by any transformation matrix will transform out of some coordinate system. We can also see that if we have a model to world transform, we can easily determine the position of the object’s “local” coordinate frame just by examining the structure of the transformation matrix. So, if you ever find yourself wanting to multiply by [1,0,0] and [0,1,0], remember, there is no spoon .

It Is The World That Has Been Pulled Over Your Eyes…

Now, what is matrix multiplication, really? It turns out that matrix multiplication, in the end, is just a bunch of dot products:

$\begin{bmatrix} a & b & c & d \\ e & f & g & h \\ i & j & k & l \\ m & n & o & p \end{bmatrix}\begin{bmatrix} x \\ y \\ z \\ w \end{bmatrix}=\begin{bmatrix} \mathbf{R}_o \\ \mathbf{R}_1 \\ \mathbf{R}_2 \\ \mathbf{R}_3 \end{bmatrix}\mathbf{V}=\begin{bmatrix} ax + by + cz + dw \\ ex + fy + gz + hw \\ ix + jy + kz + lw \\ mx + nx + oz + pw \\\end{bmatrix}=\begin{bmatrix} \mathbf{R}_o \cdot \mathbf{V} \\ \mathbf{R}_1 \cdot \mathbf{V} \\ \mathbf{R}_2 \cdot \mathbf{V} \\ \mathbf{R}_3 \cdot \mathbf{V}\end{bmatrix}$

This interesting fact tells us that every matrix also transforms into some other coordinate system, albeit in a much less obvious way. Recall our example of a viewing transform. We had:

$f(\mathbf{P}) = \begin{bmatrix} (\mathbf{P}-\mathbf{O}) \cdot \mathbf{X} \\ (\mathbf{P}-\mathbf{O}) \cdot \mathbf{Y} \\ (\mathbf{P}-\mathbf{O}) \cdot \mathbf{Z} \\\end{bmatrix}$

Another fun fact about the dot product is that it’s distributive. So:

$f(\mathbf{P}) = \begin{bmatrix} \mathbf{P}\cdot\mathbf{X} - \mathbf{O} \cdot \mathbf{X} \\ \mathbf{P}\cdot\mathbf{Y} - \mathbf{O} \cdot \mathbf{Y} \\ \mathbf{P}\cdot\mathbf{Z} - \mathbf{O} \cdot \mathbf{Z} \\\end{bmatrix}$

And our world-to-local transform can be written as:

$f(\mathbf{P}) =\begin{bmatrix} X_x & X_y & X_z &-\mathbf{X}\cdot\mathbf{O} \\ Y_x & Y_y & Y_z &-\mathbf{Y}\cdot\mathbf{O} \\ Z_x & Z_y & Z_z &-\mathbf{Z}\cdot\mathbf{O} \\ 0 & 0 & 0 & 1\end{bmatrix}\begin{bmatrix}P_x \\ P_y \\ P_z \\ 1 \end{bmatrix}$

If I want to set up my viewing transform, it is a lot simpler to build this matrix directly than to try and concatenate a whole bunch of rotation matrices together. If I want to just transform a single point, it’s easier and cheaper to just do the vector math directly. However, this is only possible if we can see beyond the usual building blocks of translation, rotation, and scale.

… To Blind You From The Truth

Matrices are a convenient, one-size-fits-all solution, but transformation matrices are ultimately an important implementation detail, not a fundamental concept. I think it’s wise to think of them as such. Matrix algebra is a powerful tool, but it tends to obscure the geometric interprettations.

There are two main reasons we use matrices in practice:

• They are able to encapsulate all of the transforms that we care about in a unified way.
• They allow us to concatenate and invert arbitrary transforms with relative ease.

Matrices are not the only way to implement a geometric transform, and if you happen to know something about the transforms that you’re using, you should almost certainly not use a full matrix.

A 4×4 matrix-vector multiplication requires 28 flops (16 multiplies and 12 adds). That’s only 16 operations if we have mads, and it’s easily vectorized, but it’s often more work than we actually need. For instance, if a transform is known to always be affine (0,0,0,1) in bottom row), we can skip the w coordinate and save 25% of the work. The most common modeling and viewing transformations fall into this category. Orthographic projections do too.

There are other small efficiencies to be had. If we know that we are rotating around a given coordinate axis, we can skip the evaluation for the axis that’s being rotated around. An even more egregious example is the use of a full matrix for simple operations like translation or scaling. Using a 4×4 matrix to perform a 3D scale is doing about 9x as much arithmetic as is actually required. There was a time when all the graphics hardware was heavily geared towards matrix operations and this didn’t matter too much. Nowadays though, with all the GPUs having scalar or VLIW ISAs, there is probably some real performance to be gained using tricks like these.