Ray tracing is like a gateway drug for graphics engineers. It starts with some harmless fun in college, but if you’re not careful it will suck you in and consume your entire life.
I was clean for quite some time, but last Siggraph there was this raytracing paper by Barringer and Akenine-Moeller, and it, combined with my shiny new Haswell laptop those 256-wide ALUs. I just couldn’t help it. I needed it. I’m writing it all down so I can finally start doing constructive with my side projects again.
There has been a lot of research into accelerating ray tracing over the past few years. Quite a few very different approaches have been developed to mapping the raytracing problem onto CPUs in the most efficient way possible. There are too many variations on the technique for me to give a complete taxonomy here, but I’ll review the most relevant paradigms:
Single Ray Traversal
The first way is the one we all learned in Graphics Kindergarten. Trace individual rays, one at a time, through the acceleration structure. Contrary to popular belief, the naive single-ray technique, properly implemented on modern CPUs, is NOT limited by the cost of computing the intersection tests. It is primarily constrained by memory access. Ray tracing makes a lot of irregular memory accesses to the acceleration structure as the way propagates down through the hierarchy, and the memory system just isn’t able to keep up. State of the art single-ray traversal adapts to this by altering the acceleration structure.
The so-called “QBVH” is the current favorite, and it works by using a 4-ary tree instead of the more traditional binary tree. Each node holds the bounding boxes of its four children, which are intersection tested 4 at a time using SIMD instructions. The speedup from the QBVH has less to do with its use of SIMD and more to do with its memory access characteristics. The QBVH ends up using less memory than a corresponding binary BVH, which brings with it a corresponding drop in memory bandwidth.
Unfortunately, as Wald et al found, increasing the ‘arity’ of the tree starts to hit diminishing returns as the SIMD width increases. It’s a nice boost for a 4-wide SIMD, but any wider than that and utilization suffers.
Packet Traversal
Packet traversal works by grouping multiple rays into SIMD-sized groups, and walking all N rays down the tree simultaneously. In its most obvious form, N is chosen to be the SIMD width (4 for SSE, 8 for AVX, and so on), and the ray/node intersection tests are re-written to operate on N rays at a time in parallel. Packet traversal can be further improved by using larger packet sizes and iterating over SIMD-sized groups at each node. Examples include the works of Wald et al. and Lauterbach, et al.
The advantage of packet traversal is that it amortizes the costly memory accesses over a large number of rays, which swings the raytracing algorithm from being memory-bound back towards being compute-bound. The disadvantage of packet traversal is that requires a great deal of coherence between the rays in a packet in order to be effective. Traversal continues down the tree until EVERY ray in the packet has missed a given node, and if all of the rays in a packet are going in very different directions, this can cause the packet traversal to degenerate into a more expensive version of single-ray traversal.
Streaming Traversal
Broadly speaking, Streaming Traversal is a variant of packet traversal, in which a very large packet is used, and rays which miss a bounding volume are gradually removed from consideration as the stream is walked down the tree. Overbeck’s partition traversal is one example, and the Dynamic Ray Stream paper is another. John Tsakok’s MBVH Ray Stream Tracing is yet another variation on this theme. And, as usual, Ingo Wald et al wrote about this one too 🙂
Streaming traversal tries to avoid the de-coherence problems of conventional packet traversal by attempting to throw so many rays down at once that any natural coherence in the ray set can be teased out and exploited.
My Traversal
One thing that bugged me about Barringer et al is the way it used SIMD. In their sample code, at least, they ended up spreading the x/y/z axes across the AVX lanes, but in a way that slightly underutilized them. I wanted to see if I could get a more traditional BVH2 traversal with SIMD across rays to give comparable performance.
For my raytracer I ended up combining an Overbeck-esque partition scheme with adaptive ray packet re-ordering. Since it’s a side project and I can do whatever I please, I decided to give myself free reign to use the full instruction set of my Haswell laptop. This means that AVX2, FMA, and all that other coolness is all on the table.
Given a large collection of rays, we begin by partitioning the rays by octant, where ‘octant(r)’ is the 3-bit integer formed by the sign bits of the ray directions. Splitting rays by octant allows some of the comparisions to be eliminated from the ray-box intersection tests. After partitioning the rays by octant, we assemble them into packets of 8 rays each. Our ray packet structure looks like this:
struct __declspec(align(64)) RayPacket { __m256 Ox; __m256 DInvx; __m256 Oy; __m256 DInvy; __m256 Oz; __m256 DInvz; __m256 TMax; uint32 RayOffsets[8]; }; |
We traverse the packets for each octant through the tree independently. At each node of the tree, we intersect all of the rays with the node’s bounding box, 8 at a time, and produce an 8-bit hit mask for each 8-ray group. I found that I get better results by 2x unrolling this intersection loop and doing two packets at once. During the intersection loop, we generate bitmasks, one per group, indicating which rays in the group struck the AABB. We also maintain a running pop-count of the number of rays which have not missed an earlier node.
Here is code:
static size_t GroupTest2X( RayPacket** pPackets, float* pAABB, uint8* pMasks, size_t nGroups ) { // NOTE: pAABB contains node bounding box pre-swizzled for ray octant // caller fetches node AABB and rearranges it for us. // The overhead of this rearranging is amortized over the whole ray stream size_t g; size_t nTwos = (nGroups&~1); size_t nHitPopulation=0; for( g=0; g<nTwos; g += 2 ) { RayPacket& pack0 = *pPackets[g]; RayPacket& pack1 = *pPackets[g+1]; __m256 Bmin = _mm256_broadcast_ss( pAABB+0 ); __m256 Bmax = _mm256_broadcast_ss( pAABB+3 ); __m256 O0 = _mm256_load_ps( (float*)&pack0.Ox ); __m256 O1 = _mm256_load_ps( (float*)&pack1.Ox ); __m256 D0 = _mm256_load_ps( (float*)&pack0.DInvx ); __m256 D1 = _mm256_load_ps( (float*)&pack1.DInvx ); __m256 r0 = _mm256_mul_ps( O0, D0 ); __m256 r1 = _mm256_mul_ps( O1, D1 ); __m256 tmin0 = _mm256_fmsub_ps( Bmin, D0, r0 ); __m256 tmax0 = _mm256_fmsub_ps( Bmax, D0, r0 ); __m256 tmin1 = _mm256_fmsub_ps( Bmin, D1, r1 ); __m256 tmax1 = _mm256_fmsub_ps( Bmax, D1, r1 ); Bmin = _mm256_broadcast_ss( pAABB+1 ); Bmax = _mm256_broadcast_ss( pAABB+4 ); O0 = _mm256_load_ps( (float*)&pack0.Oy ); O1 = _mm256_load_ps( (float*)&pack1.Oy ); D0 = _mm256_load_ps( (float*)&pack0.DInvy ); D1 = _mm256_load_ps( (float*)&pack1.DInvy ); r0 = _mm256_mul_ps( O0, D0 ); r1 = _mm256_mul_ps( O1, D1 ); __m256 t0 = _mm256_fmsub_ps( Bmin, D0, r0 ) ; __m256 t1 = _mm256_fmsub_ps( Bmax, D0, r0 ) ; __m256 t2 = _mm256_fmsub_ps( Bmin, D1, r1 ) ; __m256 t3 = _mm256_fmsub_ps( Bmax, D1, r1 ) ; tmin0 = _mm256_max_ps( tmin0, t0 ); tmax0 = _mm256_min_ps( tmax0, t1 ); tmin1 = _mm256_max_ps( tmin1, t2 ); tmax1 = _mm256_min_ps( tmax1, t3 ); Bmin = _mm256_broadcast_ss( pAABB+2 ); Bmax = _mm256_broadcast_ss( pAABB+5 ); O0 = _mm256_load_ps( (float*)&pack0.Oz ); O1 = _mm256_load_ps( (float*)&pack1.Oz ); D0 = _mm256_load_ps( (float*)&pack0.DInvz ); D1 = _mm256_load_ps( (float*)&pack1.DInvz ); r0 = _mm256_mul_ps( O0, D0 ); r1 = _mm256_mul_ps( O1, D1 ); t0 = _mm256_fmsub_ps( Bmin, D0, r0 ) ; t1 = _mm256_fmsub_ps( Bmax, D0, r0 ) ; t2 = _mm256_fmsub_ps( Bmin, D1, r1 ) ; t3 = _mm256_fmsub_ps( Bmax, D1, r1 ) ; tmin0 = _mm256_max_ps( tmin0,t0); tmax0 = _mm256_min_ps( tmax0,t1); tmin1 = _mm256_max_ps( tmin1,t2); tmax1 = _mm256_min_ps( tmax1,t3); // andnot -> uses sign-bit trick to skip tmax>0 check // movemask only looks at the upper bits, so we can use the sign bits and skip an extra compare // __m256 l0 = _mm256_min_ps( tmax0, pack0.TMax ); __m256 l1 = _mm256_min_ps( tmax1, pack1.TMax ); __m256 hit0 = _mm256_andnot_ps( tmax0, _mm256_cmp_ps(tmin0,l0,_CMP_LE_OQ) ); __m256 hit1 = _mm256_andnot_ps( tmax1, _mm256_cmp_ps(tmin1,l1,_CMP_LE_OQ) ); size_t mask0 = (size_t)_mm256_movemask_ps(hit0); size_t mask1 = (size_t)_mm256_movemask_ps(hit1); size_t nMergedMask = (mask1<<8)|mask0; *((uint16*)(pMasks+g)) = (uint16) nMergedMask; nHitPopulation += _mm_popcnt_u64(nMergedMask); } // Single-packet cleanup case omitted for brevity return nHitPopulation; } |
I use fmsub instructions for the box test, as did Barringer. This turns out better than a mul/sub pair. I believe the reason is that, on my Haswell, both muls and madds can dual-issue with each other, but add and sub cannot. If we go with a sequence of subs and muls, then the muls can never dual-issue with one another since they’re gated by subs, which only have a throughput of one per clock. After all the intersection tests, we maintain a list of pointers to the active packets, and gradually pare down the list by partitioning out groups in which all rays have missed a given node. We maintain an array of pointers to active packets just like Overbeck did. If any packets hit the node, we partition out the missed ones and then the group count and child nodes onto our traversal stack.
We use the “stored split axis” trick for ordered traversal here. During BVH construction, when a node is subdivided, we store the subdivision axis in the node. At trace time, we pick the next child to traverse by looking at the sign of the ray directions on this axis. Since we have pre-sorted our rays by octant, we can make the decision once for all the rays we’re currently traversing.
Eventually, we find ourselves left with only one active packet. At that point, we switch to a specialized single-packet traversal which uses the “two child” strategy, which Barringer also favors. We intersect one packet against the left and right child nodes at the same time. Since I’m using packets and not individual rays, I opted for a “voting” scheme for ordered traversal, choosing the next child based on whichever decision is most favorable to the majority of rays in the packet. I first picked up this idea from this RT07 paper. Below is code for the single-packet traversal loop. Each node stores a pointer to either its first triangle, or its two child nodes, so that we can prefetch them as we go down. Even with the prefetch, cache misses on node data is still my biggest bottleneck from what I can tell.
The packet traversal is a template, so that _mm_blend_ps can be used to swizzle the AABB components. CPU architects, if you’re reading this, please stop making vblendvps so slow. Duplicating all this code 8 times in a template just to shave a few instructions is not very good for the instruction cache. A vblendv variant that could be fed from an integer register instead of just an immediate would be even more useful.
// fetch AABB. each of the min/max vectors has the same data // replicated into low and high lanes // bbmin(xyz), bbmax(xyz) BVHNode* pL = pN->GetLeftChild(); BVHNode* pR = pN->GetRightChild(); __m256 min0 = _mm256_broadcast_ps( (const __m128*)(pLeft) ); __m256 min1 = _mm256_broadcast_ps( (const __m128*)(pRight) ); __m256 max0 = _mm256_broadcast_ps( (const __m128*)(pLeft+3) ); __m256 max1 = _mm256_broadcast_ps( (const __m128*)(pRight+3) ); // swap planes based on octant __m256 bbmin0 = _mm256_blend_ps(min0, max0, (OCTANT|OCTANT<<4) ); __m256 bbmax0 = _mm256_blend_ps(max0, min0, (OCTANT|OCTANT<<4) ); __m256 bbmin1 = _mm256_blend_ps(min1, max1, (OCTANT|OCTANT<<4) ); __m256 bbmax1 = _mm256_blend_ps(max1, min1, (OCTANT|OCTANT<<4) ); // prefetch left/right children since we'll probably need them _mm_prefetch( (char*) pNLeft->GetPrefetch(), _MM_HINT_T0 ); _mm_prefetch( (char*) pNRight->GetPrefetch(), _MM_HINT_T0 ); // axis tests __m256 D = _mm256_load_ps( (float*)&pack.DInvx); __m256 O = _mm256_load_ps( (float*) (OD+0) ); __m256 Bmin0 = _mm256_permute_ps(bbmin0, 0x00); __m256 Bmax0 = _mm256_permute_ps(bbmax0, 0x00); __m256 Bmin1 = _mm256_permute_ps(bbmin1, 0x00); __m256 Bmax1 = _mm256_permute_ps(bbmax1, 0x00); __m256 tmin0 = _mm256_fmsub_ps( Bmin0, D, O ); __m256 tmax0 = _mm256_fmsub_ps( Bmax0, D, O ); __m256 tmin1 = _mm256_fmsub_ps( Bmin1, D, O ); __m256 tmax1 = _mm256_fmsub_ps( Bmax1, D, O ); D = _mm256_load_ps( (float*)&pack.DInvy); O = _mm256_load_ps( (float*) (OD+1) ); Bmin0 = _mm256_permute_ps(bbmin0, 0x55); // 0101 Bmax0 = _mm256_permute_ps(bbmax0, 0x55); Bmin1 = _mm256_permute_ps(bbmin1, 0x55); Bmax1 = _mm256_permute_ps(bbmax1, 0x55); __m256 t0 = _mm256_fmsub_ps( Bmin0, D, O ); __m256 t1 = _mm256_fmsub_ps( Bmax0, D, O ); __m256 t2 = _mm256_fmsub_ps( Bmin1, D, O ); __m256 t3 = _mm256_fmsub_ps( Bmax1, D, O ); tmin0 = _mm256_max_ps( tmin0, t0 ); tmax0 = _mm256_min_ps( tmax0, t1 ); tmin1 = _mm256_max_ps( tmin1, t2 ); tmax1 = _mm256_min_ps( tmax1, t3 ); D = _mm256_load_ps( (float*)&pack.DInvz); O = _mm256_load_ps( (float*) (OD+2) ); Bmin0 = _mm256_permute_ps(bbmin0, 0xAA); // 1010 Bmax0 = _mm256_permute_ps(bbmax0, 0xAA); Bmin1 = _mm256_permute_ps(bbmin1, 0xAA); Bmax1 = _mm256_permute_ps(bbmax1, 0xAA); t0 = _mm256_fmsub_ps( Bmin0, D, O ); t1 = _mm256_fmsub_ps( Bmax0, D, O ); t2 = _mm256_fmsub_ps( Bmin1, D, O ); t3 = _mm256_fmsub_ps( Bmax1, D, O ); tmin0 = _mm256_max_ps( tmin0, t0 ); tmax0 = _mm256_min_ps( tmax0, t1 ); tmin1 = _mm256_max_ps( tmin1, t2 ); tmax1 = _mm256_min_ps( tmax1, t3 ); __m256 limL = _mm256_min_ps( tmax0, pack.TMax ); __m256 limR = _mm256_min_ps( tmax1, pack.TMax ); // using sign-bit trick for tmax >= 0 t0 = _mm256_cmp_ps( tmin0, limL, _CMP_LE_OQ ); t1 = _mm256_cmp_ps( tmin1, limR, _CMP_LE_OQ ); __m256 hitL = _mm256_andnot_ps( tmax0, t0 ); __m256 hitR = _mm256_andnot_ps( tmax1, t1 ); size_t maskhitL = _mm256_movemask_ps(hitL); size_t maskhitR = _mm256_movemask_ps(hitR); size_t maskhitB = maskhitL & maskhitR; if( maskhitB ) { __m256 LFirst = _mm256_cmp_ps( tmin0, tmin1, _CMP_LT_OQ ); size_t lf = _mm256_movemask_ps(LFirst) & maskhitB; size_t rf = ~lf & maskhitB; if( _mm_popcnt_u64( lf ) > _mm_popcnt_u64( rf ) ) { pStack[0] = pNRight; pStack[1] = pNLeft; pStack += 2; } else { pStack[0] = pNLeft; pStack[1] = pNRight; pStack += 2; } } else { if( maskhitL ) { *(pStack++) = pNLeft; } if( maskhitR ) { *(pStack++) = pNRight; } } |
Intersection Testing
This part’s boring. Just an AVX-version Kensler and Shirley. 8 rays to one tri, with some preprocessing. Nothing to see here. You can go look at my code if you’re curious. I did not spend a lot of time on triangle intersection and the code could probably be improved with some massaging.
Ray Reordering
The number of active groups, combined with the population count we did during the intersection test, gives us a measure of the “utilization” of the ray stream. If we find that we have M active groups, but less than 4M active rays, then half of the SIMD lanes in our intersection tests are not doing any useful work. When this happens, we can improve efficiency by splitting up the packets and reassembling them into coherent ones, as crudely illustrated below:
Reordering the rays every so often makes our traversal more efficient, because we can replace lots of underutilized packet/box tests with a smaller number of fully utilized ones.
If we’re not careful, this reordering can easily turn into the most expensive part of the algorithm. Rather than attempting to somehow scatter the rays into new locations, I found it better to just discard the old packets and overwrite them with ray data read from the original input stream. Note that in order for this to work, it is necessary to first ensure that any intersection results have been stored off since the packets are about to be invalidated. I deal with this problem by writing intersection results into their final locations during the ray/triangle intersection test, rather than attempting to store them in packetized form.
The reordering happens in two phases. Phase one produces an array of reordered array locations, and phase two fetches the corresponding rays and overwrites the packets.
In phase one, I produce a flat array containing the reordered ray indices, such that the hit rays are all at the front. Each element of this array will contain the byte offset of the ray which belongs in that location in the reordered ray stream. This phase is done by examining each ray’s bit in the array of hit masks and putting it at either the head or the tail of the ray stream. While this is easily done with bit-twiddling loops, I found it slightly faster to do this in a more obtuse way.
SIMD Prefix Sum
UPDATE: This section is now obsolete. See here.
For a hit ray, each ray’s location in the reordered ray set is equal to the number of hit rays which preceded it in the original set. To figure out the location of a given ray, we have to add up the number of like rays which come before it. This operation is known as a prefix sum, and shows up a lot in parallel programming. Similarly, for a miss ray, its location is the num_rays-1, minus the number of missed rays which precede it in the ray set.
The following table provides an example:
Ray hit masks: 1 0 1 1 0 1 1 0 Hit ray locations: 0 - 1 2 - 3 4 - Miss ray locations: - 7 - - 6 - - 5
Once we have the hit ray locations, the miss locations can be computed cheaply from the prefix sum. Since all rays that are not hits are misses, we can just subtract the hit sum from the total number of prior rays (that is, from the ray’s index), then subtract the result from ‘num_rays-1’
Ray hit masks: 1 0 1 1 0 1 1 0 prefix sum: 0 1 1 2 3 3 4 4 ray idx: 0 1 2 3 4 5 6 7 idx - prefixsum: 0 0 1 1 1 2 2 2 7-(idx-prefixsum): 7 7 6 6 6 5 5 5
Going over the array of packets, we first use horizontal SSE instructions to compute each ray’s location in the reordered array. We use this to construct an array of pointers to rays, in the order in which they will go into the reordered packets. Here is code:
static __m256i __forceinline BROADCASTINT( size_t x ) { return _mm256_broadcastd_epi32(_mm_cvtsi32_si128((int)x)); } // for( packets i ) ... uint32* __restrict pPacketRayIDs = pPackets[i]->RayOffsets; __m256i ONES = BROADCASTINT(1); __m256i INDEX = _mm256_load_si256((__m256i*)SHIFTS); __m256i masks = BROADCASTINT(pMasks[i]); masks = _mm256_srlv_epi32(masks, INDEX ); // mask >> lane_idx __m256i hits = _mm256_and_si256(masks,ONES); // value is 1 for each lane if it hit // We want to sum things as follows: where: hxy is sum(hx...hy) // // <--- Columns are SIMD lanes ----> // -------------------------------- // 0 0 0 0 0 0 0 0 // 0 h0 0 h2 0 h4 0 h6 // 0 0 h01 h01 0 0 h45 h45 // 0 0 0 0 h03 h03 h03 h03 // __m256i Doubles = _mm256_hadd_epi32(hits,hits); // h0+h1, h2+h3, h0+h1,h2+h3, h4+h5, h6+h7, h4+h5, h6+h7 __m256i Quads = _mm256_hadd_epi32(Doubles,Doubles); // h0-h3, h0-h3, h0-h3,h0-h3, h4-h7, h4-h7, h4-h7, h4-h7 Quads = _mm256_inserti128_si256(_mm256_setzero_si256(),_mm256_castsi256_si128(Quads),0x1); // 0,0,0,0, h0-h3 ..... __m256i m0 = _mm256_shuffle_epi32( hits, SHUFFLE(0,0,2,2) ); // h0, h0, h2, h2, h4, h4, h6, h6 __m256i m1 = _mm256_shuffle_epi32( Doubles, SHUFFLE(0,0,0,0) ); // h0+h1 ...., h4+h5,.... m0 = _mm256_blend_epi32( _mm256_setzero_si256(), m0, 0xAA ); // 10101010 -> 0 h0 0 h2 0 h4 0 h6 m1 = _mm256_blend_epi32( _mm256_setzero_si256(), m1, 0xCC ); // 11001100 -> 0 0 h01 h01 0 0 h45 h45 __m256i hitPrefix = _mm256_add_epi32(Quads,_mm256_add_epi32(m0,m1)); // prefix sum on hit rays __m256i missPrefix = _mm256_sub_epi32(INDEX,hitPrefix); // prefix sum on missed rays __m256i missMask = _mm256_sub_epi32(hits,ONES); // 0xffffffff if lane is a miss, 0 otherwise __m256i hitAddr = BROADCASTINT(nHitLoc); __m256i missAddr = BROADCASTINT(nMissLoc-1); hitAddr = _mm256_add_epi32(hitAddr,hitPrefix); missAddr = _mm256_sub_epi32(missAddr,missPrefix); __m256i addr = _mm256_blendv_epi8(hitAddr,missAddr,missMask); __m128i hi = _mm256_extracti128_si256(addr,1); __m128i lo = _mm256_extracti128_si256(addr,0); // Pulling the ray data in ahead of time helps a little bit _mm_prefetch( (char*)(pRays)+pPacketRayIDs[0],_MM_HINT_T0 ); _mm_prefetch( (char*)(pRays)+pPacketRayIDs[1],_MM_HINT_T0 ); _mm_prefetch( (char*)(pRays)+pPacketRayIDs[2],_MM_HINT_T0 ); _mm_prefetch( (char*)(pRays)+pPacketRayIDs[3],_MM_HINT_T0 ); _mm_prefetch( (char*)(pRays)+pPacketRayIDs[4],_MM_HINT_T0 ); _mm_prefetch( (char*)(pRays)+pPacketRayIDs[5],_MM_HINT_T0 ); _mm_prefetch( (char*)(pRays)+pPacketRayIDs[6],_MM_HINT_T0 ); _mm_prefetch( (char*)(pRays)+pPacketRayIDs[7],_MM_HINT_T0 ); pIDs[ _mm_extract_epi32(lo,0) ] = pPacketRayIDs[0]; pIDs[ _mm_extract_epi32(lo,1) ] = pPacketRayIDs[1]; pIDs[ _mm_extract_epi32(lo,2) ] = pPacketRayIDs[2]; pIDs[ _mm_extract_epi32(lo,3) ] = pPacketRayIDs[3]; pIDs[ _mm_extract_epi32(hi,0) ] = pPacketRayIDs[4]; pIDs[ _mm_extract_epi32(hi,1) ] = pPacketRayIDs[5]; pIDs[ _mm_extract_epi32(hi,2) ] = pPacketRayIDs[6]; pIDs[ _mm_extract_epi32(hi,3) ] = pPacketRayIDs[7]; size_t nHitPop = _mm_popcnt_u64(pMasks[i]); nHitLoc += nHitPop ; nMissLoc -= (8-nHitPop); |
Re-Reading Rays
At this point, I have a reordered set of pointers to arbitrary rays, and I wish to fetch data from these pointers and assemble them into packets. You might think that with Haswell and AVX2, I would be best served using a series of gather instructions for this. Nope. Not even close. The gather instruction in Haswell is micro-coded, and is thus appaulingly slow, and should be avoided whenever there is any coherence whatsoever in the data. Broadwell’s not much better, from what I gather. Perhaps Intel will one day get around to making the gather instruction useful, but until it’s fast enough to be worthwhile, it’s just going to languish in obscurity.
In my case, since the ray data are all adjacent in memory, I’m better off doing something like an 8×8 matrix transpose. A series of half-width loads and shuffles is considerably faster than gathers, despite the latter being simpler and smaller. I also found, to my surprise, that it is better to re-compute the reciprocal ray directions during the reordering phase than to try to store and re-load them. I suppose this is because it reduces cache pressure.
Here is code:
static __m256 __forceinline RCPNR( __m256 f ) { __m256 rcp = _mm256_rcp_ps(f); __m256 rcp_sq = _mm256_mul_ps(rcp,rcp); __m256 rcp_x2 = _mm256_add_ps(rcp,rcp); return _mm256_fnmadd_ps( rcp_sq, f, rcp_x2 ); } // for( packets i ) const uint32* __restrict pOffsets = pRayOffsets + 8*i; // Load lower halves into L0-L3, and upper halves into L4-L7 // Lower half contains origin (x,y,z) and TMax // Upper half contains directions(x,y,z), and byte offset from start of ray stream // // Using 128-bit loads and inserts is preferable to 256-bit loads and cross-permutes // The inserts can be fused with the loads, and Haswell can issue them on more ports that way #define LOADPS(x) _mm_load_ps((float*)(x)) __m256 l0 = _mm256_set_m128( LOADPS(pRays + pOffsets[1]), LOADPS(pRays+pOffsets[0])); __m256 l1 = _mm256_set_m128( LOADPS(pRays + pOffsets[3]), LOADPS(pRays+pOffsets[2])); __m256 l2 = _mm256_set_m128( LOADPS(pRays + pOffsets[5]), LOADPS(pRays+pOffsets[4])); __m256 l3 = _mm256_set_m128( LOADPS(pRays + pOffsets[7]), LOADPS(pRays+pOffsets[6])); __m256 l4 = _mm256_set_m128( LOADPS(pRays + pOffsets[1]+16), LOADPS(pRays+pOffsets[0]+16)); __m256 l5 = _mm256_set_m128( LOADPS(pRays + pOffsets[3]+16), LOADPS(pRays+pOffsets[2]+16)); __m256 l6 = _mm256_set_m128( LOADPS(pRays + pOffsets[5]+16), LOADPS(pRays+pOffsets[4]+16)); __m256 l7 = _mm256_set_m128( LOADPS(pRays + pOffsets[7]+16), LOADPS(pRays+pOffsets[6]+16)); #undef LOADPS __m256 t4 = _mm256_unpacklo_ps(l4,l5); // 44 55 44 55 __m256 t5 = _mm256_unpacklo_ps(l6,l7); // 44 55 44 55 t4 = RCPNR(t4); // both 4 and 5 get rcp'd eventually, so we can start them earlier t5 = RCPNR(t5); // to give the other ports something to do during this monstrous blob of unpacks __m256 t0 = _mm256_unpacklo_ps(l0,l1); // 00 11 00 11 __m256 t1 = _mm256_unpacklo_ps(l2,l3); // 00 11 00 11 __m256 t2 = _mm256_unpackhi_ps(l0,l1); // 22 33 22 33 __m256 t3 = _mm256_unpackhi_ps(l2,l3); // 22 33 22 33 __m256 t6 = _mm256_unpackhi_ps(l4,l5); // 66 77 66 77 __m256 t7 = _mm256_unpackhi_ps(l6,l7); // 66 77 66 77 __m256 Ox = _mm256_unpacklo_ps(t0,t1); // 00 00 00 00 __m256 Oy = _mm256_unpackhi_ps(t0,t1); // 11 11 11 11 __m256 Oz = _mm256_unpacklo_ps(t2,t3); // 22 22 22 22 __m256 TMax = _mm256_unpackhi_ps(t2,t3); // 33 33 33 33 __m256 Dx = _mm256_unpacklo_ps(t4,t5); __m256 Dy = _mm256_unpackhi_ps(t4,t5); __m256 Dz = _mm256_unpacklo_ps(t6,t7); __m256 RID = _mm256_unpackhi_ps(t6,t7); RayPacket* __restrict pPacket = pPackets[i]; _mm256_store_ps( (float*)&pPacket->Ox, Ox ); _mm256_store_ps( (float*)&pPacket->DInvx, (Dx) ); _mm256_store_ps( (float*)&pPacket->Oy, Oy ); _mm256_store_ps( (float*)&pPacket->DInvy, (Dy) ); _mm256_store_ps( (float*)&pPacket->Oz, Oz ); _mm256_store_ps( (float*)&pPacket->DInvz, RCPNR(Dz) ); _mm256_store_ps( (float*)&pPacket->TMax, TMax ); _mm256_store_ps( (float*)pPacket->RayOffsets, RID ); |
I’d be curious to know if OpenCL or ISPC are able to take an SPMD-style implementation of this and produce anything like the above code. I’m a bit skeptical. More likely its just going to do the gathers and be done with it, which would be fine, if gathers were really the best way.
Results
My code can be found here. Included in my github repo is a copy of Barringer’s code from here.
We’ll look at the time required to scatter 20M photons in the BART Kitchen scene, which I’ve converted to ply and included in the repo. All surfaces are set to boring grey diffuse. Photons are traced through up to 12 bounces, and russian roulette is used to terminate each photon path with a termination probability of 0.3. This results in roughly 23M rays shot. Photons are emitted in batches of 4096. During the first bounce, a full load of 4096 rays is traced, with subsequent bounces gradually removing photons as they miss the scene or are absorbed. The average number of rays traced in each ray stream across all bounces is 867.
Tests run on a Haswell i3-4010U, one core, code compiled with MSVC express 2012. Results are all averaged over 3 runs. I report two sets of numbers. The total time, and the amount of time spent actually raytracing, which is only around 60% of the total.
I’m not beating Barringer et al yet, but I’m getting really close. Some cursory tests show that if I pull the bounce limit in so that my streams are larger (~2330 or so), then I start to pull ahead, albeit slowly. Given a more complicated renderer that can always generate large streams across ray generations, I suspect that I’ll continue to out-pace them. I also suspect that this scheme is going to end up being the best way to utilize 16-wide SIMD, but more testing is needed, and I need to go into raytracing rehab for a while.
thanks for the write up,
It is an interesting read.
I didn’t know about the diminishing returns of the simd lanes, but good to see you are addressing that.