Raytracing Adendum

This is an addendum to my previous post, read it for context.

Fabian “Rygorous” Giesen suggested some great improvements to the ray reordering. First by pointing out that hadds are the wrong way to do a prefix sum, and then pointing it that the prefix sum can be completely replaced by a table of precomputed shuffles.

Here is the new prefix-free reordering code. It works by taking the packet’s hit mask and splitting it into two 4-bit halves. For each half, we read the ray IDs, shuffle them into contiguous hit/miss sets, and then use unaligned stores to glue them onto the reordered array. This is made fast by noting that we can precompute the shuffling patterns for each 4-bit hit mask.

This ends up doing multiple overlapping writes to the same location, but everything comes out right in the end as long as we’re careful about doing them in the right order. It also requires unaligned __m128 stores, but on newer CPUs these are no longer as expensive as they used to be.

    static const __m128i SHUFFLE_TABLE[16] = {
         _mm_setr_epi8(12,13,14,15, 8, 9,10,11, 4, 5, 6, 7, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3,12,13,14,15, 8, 9,10,11, 4, 5, 6, 7),
         _mm_setr_epi8( 4, 5, 6, 7,12,13,14,15, 8, 9,10,11, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7,12,13,14,15, 8, 9,10,11),
         _mm_setr_epi8( 8, 9,10,11,12,13,14,15, 4, 5, 6, 7, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3, 8, 9,10,11,12,13,14,15, 4, 5, 6, 7),
         _mm_setr_epi8( 4, 5, 6, 7, 8, 9,10,11,12,13,14,15, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15),
 
         _mm_setr_epi8(12,13,14,15, 8, 9,10,11, 4, 5, 6, 7, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3,12,13,14,15, 8, 9,10,11, 4, 5, 6, 7),
         _mm_setr_epi8( 4, 5, 6, 7,12,13,14,15, 8, 9,10,11, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7,12,13,14,15, 8, 9,10,11),
         _mm_setr_epi8( 8, 9,10,11,12,13,14,15, 4, 5, 6, 7, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3, 8, 9,10,11,12,13,14,15, 4, 5, 6, 7),
         _mm_setr_epi8( 4, 5, 6, 7, 8, 9,10,11,12,13,14,15, 0, 1, 2, 3),
         _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15),
    };
 
    static void __fastcall ReorderRays( StackFrame& frame, size_t nGroups )
    {        
        RayPacket** pPackets = frame.pActivePackets;
 
        uint32 pReorderIDs[MAX_TRACER_SIZE];
 
        size_t nHits   = 0;
        size_t nFirstMiss = 8*nGroups;
 
        const char* pRays = (const char*) frame.pRays;
        for( size_t i=0; i<nGroups; i++ )
        {
            uint32* __restrict pPacketRayIDs = pPackets[i]->RayOffsets;
 
            uint64 hits    = frame.pMasks[i];
            uint64 hit_lo  = (hits & 0x0f);
            uint64 hit_hi  = (hits & 0xf0)>>4;
            uint64 pop_lo  = _mm_popcnt_u64(hit_lo);
            uint64 pop_hi  = _mm_popcnt_u64(hit_hi);
 
             // load lo/hi ID pairs
            __m128i id_lo = _mm_load_si128( (__m128i*) pPacketRayIDs );
            __m128i id_hi = _mm_load_si128( (__m128i*) (pPacketRayIDs+4) );
 
            // store hit/miss iDs
            __m128i shuf_lo   = _mm_shuffle_epi8( id_lo, SHUFFLE_TABLE[hit_lo] );
            __m128i shuf_hi   = _mm_shuffle_epi8( id_hi, SHUFFLE_TABLE[hit_hi] );
            _mm_storeu_si128( (__m128i*)&pReorderIDs[nHits], shuf_lo );
            nHits      +=  pop_lo;
            _mm_storeu_si128( (__m128i*)&pReorderIDs[nHits], shuf_hi );
            nHits      +=  pop_hi;
 
            // NOTE: Hits MUST be written before misses, or a full-hit packet can corrupt the miss area
            _mm_storeu_si128( (__m128i*)&pReorderIDs[nFirstMiss-4], shuf_lo );
            nFirstMiss -= 4-pop_lo;
            _mm_storeu_si128( (__m128i*)&pReorderIDs[nFirstMiss-4], shuf_hi );
            nFirstMiss -= 4-pop_hi;
        }
 
        ReadRaysLoopArgs args;
        args.pPackets = pPackets;
        args.pRayIDs = pReorderIDs;
        args.pRays = (const byte*)pRays;
        ReadRaysLoop(args,nGroups);
    }

Here are the new benchmark results. Same parameters as in previous post (time to scatter 20M photons, averaged over 3 runs). Code is available here.

graph2

normalized

I’m now consistently faster than Barringer et al, though not by very much. There is probably still room for improvement.

Some algorithmic changes might help. Our method utilizes the SIMDs better near the top of the tree, but is probably worse down at the bottom once all of our ray coherence has gone to pot. Better results would probably be achieved by combining both approaches, selecting a cutoff depth and switching to single-ray streaming traversal below that threshold. It might also be beneficial to change BVH styles down at the lower levels as well. Perhaps switch it up so that the lowest 5 nodes or so are “QBVH-style”

Alternatively, I suspect that it might be profitable to design a raytracer that aims for a very high ray load in every traced packet. In my simple test, the number of available rays drops gradually as the rays bounce around, so some of the tracing runs contain less rays than others. If they could all be made maximally full, then we might be able to collect enough rays to avoid losing coherence at the leaves. However, achieving this goal requires a much more elaborate system at the top for managing the in-flight rays. Such things have been done before though. See here, here, and here