Page MenuHome

Cycles PMJ adaptive sampling working poorly with many bounces for Cycles-X
ClosedPublic

Authored by William Leeson (leesonw) on Aug 13 2021, 6:48 PM.

Details

Summary

This fixes T79190
Problem:
The is correlation at the edges of the image tiles due to using the same hash twice with and xor. Also the PMJ sampling pattern when used with xor scrambling does not generate a good sample set which is causing the rays to not sample the higher dimensions correctly.
Solution:
Use a different pixel hash to avoid clashes with hash used during path tracing. Also, uses Cranley-Patterson rotation (which matches the Sobol sampling scheme) with the PMJ sampling scheme instead of using xor scrambling which was resulting in a bad sample set. The number of samples drawable from the PMJ sampler is also increased by using all the PMJ sample sets instead of just one.
Issues:
The new samples produce images with more noise than the original PMJ sets and also the Sobol sets. This unfortunately is not avoidable as the nature of jitter sampling produces more noise as there is little correlation between sample sets. To alleviate this to some extent the Cranley-Patterson rotation method from the Sobol sample generate is used as it has some degree of correlation due to the simple hash used.

Diff Detail

Repository
rB Blender

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires changes to proceed.Aug 23 2021, 3:30 PM

I reformatted the code with

make format

so it should look correct now. I'll investigate my formatter setup to see what's up for future commits. However, the increased noise is due to the nature of sampling as you say but I am a bit worried about the fact that the simple random sampler tends to reduce the noise vs a better random sampler so it may cause incorrect convergence results.

Adds references for the hashing functions.

William Leeson (leesonw) marked 4 inline comments as done.Aug 24 2021, 2:05 PM

Updated the diff to add references and some hashing choices.

intern/cycles/kernel/kernel_jitter.h
43–49

This was no longer needed so I removed it.

William Leeson (leesonw) marked an inline comment as done.Aug 25 2021, 9:40 AM

Update comments.

So the conclusion was that the area light noise difference is due to correlation of sample sets? Or are you still investigating that?

Yes I am, it's not the correlation that is causing the issue it the correlation between pixels that is reducing the noise do to cmj_hash_simple. Actually, I am beginning to think there is something else wrong. There are way too many totally black pixels so I am investigating that. Currently I am looking into using different sample sets and sizes i.e. 8x8 with 32 patterns to see what happens and it has caused some interesting issue that I am following up on.

Uses Owen shuffle on the full sample set instead of the smaller subset to give better sample dimension mixing.

William Leeson (leesonw) updated this revision to Diff 41278.EditedAug 31 2021, 4:27 PM

I did a analysis of all the different methods and got the following. I think that one could get the same results with a uniform divisions and CPR with Owen shuffling which in theory would remove the need to create the PMJ table. Notice that all xor techniques don't seem to converge correctly probably due to some properties of the CMJ hash (I tried both the simple and the full one)


The results are all in the following spreadsheet

William Leeson (leesonw) updated this revision to Diff 41304.EditedSep 1 2021, 10:06 AM

This is the best sampling scheme from the graph above it generally runs faster than Sobol with similar convergence. It uses the PMJ table size 32 with CPR. I also put the various options in #ifdefs. You can choose the XOR shuffle with _XOR_SHUFFLE_ and the CPR can be removed with _NO_CRANLEY_PATTERSON_ROTATION_. In case one wants to alter it later. The scheme used by default is a PMJ table with Owen shuffling and CPR this way you get an infinite sample set that has convergence properties very similar to Sobol. The light spread is still not rendered as smoothly as with Sobol at low samples but it is similar to the original PMJ and converges correctly unlike the original PMJ scheme. The scheme choosen is the pmj_cpr_owen_32 from the graph.

I added a #ifdef also to be able to select the simple hash if desired also.

Defaults to the full hash instead of the simple one. I had accidentally submitted it using the simple hash.

Brecht Van Lommel (brecht) requested changes to this revision.Sep 1 2021, 5:04 PM

In terms of performance and noise this looks good to me now. Only some comments on the implementation details.

intern/cycles/kernel/kernel_jitter.h
18

Remove this.

81

Own -> Owen

112

Cranley Petterson -> Cranley-Patterson

157

Cranley Petterson -> Cranley-Patterson

intern/cycles/kernel/kernel_random.h
144

Leave out hash_tong since the unlicense is a bit unusual and we're not using the code.

intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

See my previous comment on this:

I can't find this in GCC, which is what we use for release builds. It seems GCC has __builtin_bswap32 instead.

I don't think __has_builtin exists for visual studio, so that might not compile. Not sure if there is an equivalent builtin function there.

This revision now requires changes to proceed.Sep 1 2021, 5:04 PM
Ray Molenkamp (LazyDodo) added inline comments.
intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

There is no built-in intrinsic for this, also msvc does not support __has_builtin to pile on some more bad news, the preprocessor does not seem to do short evaluation so something like

#elif !defined(_MSC_VER) && __has_builtin(__builtin_bitreverse32)

won't work either.

the neatest way to go about this would probably be something like sticking

#if defined(__GNUC__)
#  if __has_builtin(__builtin_bitreverse32)
#    define HAS_BITREVERSE32
#  endif
#endif

into compat.h and having a #elif defined(HAS_BITREVERSE32) over here.

intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

My comment about __builtin_bswap32 is wrong, it reverse bytes, not bits.

Perhaps the best solution here would be to implement our own bitreverse with intrinsics, if this really matters for performance.

For Arm there is vrbitq_u8.

For SSE this could work:
https://gist.github.com/sneves/8702353

intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

We'd have to test, i'm not super sure sse will actually be faster when you only need/want the result of single lane

William Leeson (leesonw) marked an inline comment as done.

Updated code based on comments from @Brecht Van Lommel (brecht)

William Leeson (leesonw) marked 5 inline comments as done.Sep 1 2021, 10:09 PM

Code is updated now with the changes.

intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

bswap32 reorders the bytes whereas bitreverse32 reorders the bits so it won't work as a replacement. Not all x86 CPUs have a bit reverse instruction but ARM does so it should help on OS X but not generally on x86 unless you compile for a specific CPU version. I added #if defined(__has_builtin) && __has_builtin(__builtin_bitreverse32) so that should fix the MSVC case.

intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

I came up with this

/* Reverses the bits of a integer */
#define REVERSEBITS(V) \
  uint32_t mask = 0b11111111111111110000000000000000; \
  V = (V & mask) >> 16 | (V & ~mask) << 16; \
  mask = 0b11111111000000001111111100000000; \
  V = (V & mask) >>  8 | (V & ~mask) <<  8; \
  mask = 0b11110000111100001111000011110000; \
  V = (V & mask) >>  4 | (V & ~mask) <<  4; \
  mask = 0b11001100110011001100110011001100; \
  V = (V & mask) >>  2 | (V & ~mask) <<  2; \
  mask = 0b10101010101010101010101010101010; \
  V = (V & mask) >>  1 | (V & ~mask) <<  1; \

/* Reverses the bits of a integer */
ccl_device_inline uint32_t reverse_integer_bits(uint32_t b)
{
#if defined(__KERNEL_CUDA__)
  b = __brev(b);
#elif defined(__has_builtin)
#if __has_builtin(__builtin_bitreverse32)
  b = __builtin_bitreverse32(b);
#else
  REVERSEBITS(b);
#endif
#else
  REVERSEBITS(b);
#endif
  return b;
}

It's not perfect I could replace the macro with an inline function. As for the performance I am not sure it is worth it most of the time gains were from reducing the texture size.

intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

I added #if defined(has_builtin) && has_builtin(__builtin_bitreverse32) so that should fix the MSVC case.

That doesn't work, like I said the preprocessor evaluates the full expression regardless if the first part is true or not, this is not specific to MSVC , neither Clang nor GCC will have a happy time with:

#include <stdio.h>

int main()
{
    #if defined(test) && test(1)
    printf("ok!");
    #endif
    return 0;
}
intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

For SSE this could work: https://gist.github.com/sneves/8702353

Unsure what is up with their selector, had to fix the code for it to give identical results to our cpu implementation, but the following gives identical results in about half the time in my synthetic (outside of cycles-x) tests on my ivy-i7 , so that is better than (i) expected.

static inline uint32_t bitrev_sse3(uint32_t value)
{
	const __m128i sel = _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
	const __m128i rev0 = _mm_set_epi8(0x0F, 0x07, 0x0B, 0x03, 0x0D, 0x05, 0x09, 0x01, 0x0E, 0x06, 0x0A, 0x02, 0x0C, 0x04, 0x08, 0x00);
	const __m128i rev1 = _mm_set_epi8(0xF0, 0x70, 0xB0, 0x30, 0xD0, 0x50, 0x90, 0x10, 0xE0, 0x60, 0xA0, 0x20, 0xC0, 0x40, 0x80, 0x00);
	const __m128i c0f = _mm_set1_epi8(0x0F);

	__m128i t1, t2, t3, t4,x;
	x = _mm_set_epi32(value, 0, 0, 0);
	t1 = _mm_shuffle_epi8(x, sel);   /* reverse bytes */
	t2 = _mm_and_si128(t1, c0f);     /* x & 0xf */
	t3 = _mm_srli_epi32(_mm_andnot_si128(c0f, t1), 4); /* x & 0xF0 >> 4 */

	t2 = _mm_shuffle_epi8(rev1, t2); /* rev(x) << 4 */
	t3 = _mm_shuffle_epi8(rev0, t3); /* rev(x)      */

	t4 = _mm_or_si128(t2, t3); /* combine 4-bit halves */
	return _mm_extract_epi32(t4, 0);
}

I came up with the following last night after running a few trials

static inline uint32_t bitReverse1( uint32_t x )
{
#if defined(__arm__) || defined(__aarch64__)
    __asm__( "rbit %0, %1" : "=r" ( x ) : "r" ( x ) );
    return x;
#elif defined(__KERNEL_CUDA__)
    xb = __brev(x);
    return x;
#elif defined(__has_builtin)
#if __has_builtin(__builtin_bitreverse32)
    x = __builtin_bitreverse32(x);
    return x;
#endif
#endif
    // Flip pairwise
    x = ( ( x & 0x55555555 ) << 1 ) | ( ( x & 0xAAAAAAAA ) >> 1 );
    // Flip pairs
    x = ( ( x & 0x33333333 ) << 2 ) | ( ( x & 0xCCCCCCCC ) >> 2 );
    // Flip nibbles
    x = ( ( x & 0x0F0F0F0F ) << 4 ) | ( ( x & 0xF0F0F0F0 ) >> 4 );

    // Flip bytes. CPUs have an instruction for that, pretty fast one.
#ifdef _MSC_VER
    return _byteswap_ulong( x );
#elif defined(__INTEL_COMPILER)
    return (uint32_t)_bswap( (int)x );
#else
    // Assuming gcc or clang
    return __builtin_bswap32( x );
#endif
}

It appears to be faster than the SSE and has less requirements. I did some tests and got the following results
(1) Duration = 0.007383 secs for 10000000 iterations value = -1 (new code above)
(2) Duration = 0.007554 secs for 10000000 iterations value = -1 (original version)
(3) Duration = 0.009827 secs for 10000000 iterations value = -1 Knuth bit reverse
(4) Duration = 0.007732 secs for 10000000 iterations value = -1 Classic binary partitioning
(5) Duration = 0.014615 secs for 10000000 iterations value = -1 (SSE4 code from above)
(6) Duration = 0.037416 secs for 10000000 iterations value = -1 (simple loop)

I updated the spelling mistakes :-( and spent a bit on time finding a good bit reverse solution. I figured the artists would appreciate the quicker renders. It now uses the built in functions where they are provided even for ARM :-) Also I tired a few different bit reverse algorithms on both gcc and clang (unfortunately I don't have a Windows so I could not check msvc). I got the following results from those trials

(1) Duration = 0.007383 secs for 10000000 iterations value = -1 Reworked original algorithm
(2) Duration = 0.007554 secs for 10000000 iterations value = -1 Original version
(3) Duration = 0.009827 secs for 10000000 iterations value = -1 Knuth bit reverse
(4) Duration = 0.007732 secs for 10000000 iterations value = -1 Classic binary partitioning
(5) Duration = 0.014615 secs for 10000000 iterations value = -1 SSE4 provided by @Ray Molenkamp (LazyDodo)
(6) Duration = 0.037416 secs for 10000000 iterations value = -1 Simple loop

The original version proved the best on gcc. However, for clang the built in bit reverse was best also the results in general were significantly better by a factor of ~1.25. Both compilers did a pretty good job of vectorizing the code when provided with that option (-msse2 -msse3 and -msse4.1 -msse4.2 etc...). I also came up with a slightly better solution for the preprocessor gymnastics in the previous solution that I hope is easier to read. You may want to move the generation of __has_builtin to the compat.h header but for now I put it in the utils_math.h as this is the only place that uses it.

Sergey Sharybin (sergey) requested changes to this revision.Sep 2 2021, 4:43 PM

Nice work!

The file from T79190 renders so much cleaner now. There is still seems to be increased noise in light_spread.blend (don't forget to switch it to PMJ when doing tests), but it might be something we have to accept and investigate separately. I did not see any noise increase compared old and new PMJ in other benchmark scenes. The new code seems to be about 0.4% slower on RTX 6000 (did average of few runs), which is acceptable.

intern/cycles/kernel/kernel_jitter.h
77
103

Make sure code is formatted with clang-format (either make format or via IDE integration). The #ifndef _SIMPLE_HASH_ is supposed to be indented # ifndef _SIMPLE_HASH_.

103–104

This seems specious to me: the code uses cmj_randfloat_simple in the case _SIMPLE_HASH_ is NOT defined.
To rephrase it: it is cmj_randfloat_simple codepath which is active in the current state of the patch.

Is this expected, or should it be #ifdef _SIMPLE_HASH_ instead?

intern/cycles/util/util_math.h
796 ↗(On Diff #41365)

Run clang-format.

But what is more important, please add a test to util_math_test.cpp to make sure the implementation matches across different platforms/CPUs.

This revision now requires changes to proceed.Sep 2 2021, 4:43 PM

I have updated the comments and added a test to util_math_test.h for the bit reverse. I tweaked the CPR to only jitter with the PMJ cells/intervals which reduces the noise in the light spread scene. This does a better job than using the simple hash rng that was used previously to match the Sobol scheme.

William Leeson (leesonw) marked 5 inline comments as done.Sep 3 2021, 12:11 PM
William Leeson (leesonw) added inline comments.
intern/cycles/kernel/kernel_jitter.h
77

This should be fixed now, I went over all the comments and adjusted them based on the guide.

103

I ran make format and checked in the changes so this should be good now.

103–104

This was to match the Sobol sampler, but now I tried just using CPR within the cells and intervals of the PMJ table which is looks much better.

intern/cycles/util/util_math.h
796–797 ↗(On Diff #41320)

This is fixed now, I added a __has_builtin if it does not exist that way it works on MSVC and others.

796–797 ↗(On Diff #41320)

Yeah sorry I had pushed my comments before I had noticed this sorry :-(

William Leeson (leesonw) marked 4 inline comments as done.

Add back in the shuffle to the PMJ table generator.

William Leeson (leesonw) updated this revision to Diff 41410.EditedSep 3 2021, 2:14 PM

Small update to fix a typo :-/

Ray Molenkamp (LazyDodo) requested changes to this revision.Sep 3 2021, 4:05 PM
Ray Molenkamp (LazyDodo) added inline comments.
intern/cycles/util/util_math.h
798 ↗(On Diff #41410)

at this point I'm just confused, what is happening here? depending on the platform/compiler at play this function can now return 3 possible outputs for a single input, is this really something we'd want to do?

For x = 0x12345678:

__asm__("rbit %0, %1" : "=r"(x) : "r"(x));  // Can't validate, but i'd expect this to be 1e6a2c48
__brev(x);                                  // Can't validate, but i'd expect this to be 1e6a2c48
__builtin_bitreverse32(x);                  // tested: 1e6a2c48
/* Flip pairwise. */                        // tested: 482c6a1e
_byteswap_ulong(x);                         // tested: 78563412
(uint32_t)_bswap((int)x);                   // tested: 78563412
__builtin_bswap32(x);                       // tested: 78563412

judging by this functions name, I'd expect the correct answer to be 1e6a2c48.

also in it's current form it will still not build on MSVC due to the unprotected use of __has_builtin

We'd probably want to take a step back here and reevaluate what we want to do before adding any more changes

This revision now requires changes to proceed.Sep 3 2021, 4:05 PM
intern/cycles/util/util_math.h
798 ↗(On Diff #41410)

The first 3 options just call the respective build in functions for reversing the bits. The remaining part does some bit flipping first and then uses the byte swapping functions to do the remaining bit as they use less insructions. So you can't just use a byte swap on it's own. You need the code above those first then it uses the bswap or _byteswap_ulong routine to complete the operation. That is why the results differ above.

Looks good from my side.

This also seems to fix a PMJ correlation issue in a recent test .blend file, which is nice.

Checked reverse_integer_bits on CUDA, GCC, Clang, and ICC. In all cases reverse_integer_bits(0x12345678) gives 0x1e6a2c48.

Some minor inlined comment which can be addressed without extra review iteration, otherwise I think is a good improvement!

intern/cycles/kernel/kernel_jitter.h
113

Cranley-Patterson

intern/cycles/util/util_math.h
798 ↗(On Diff #41410)

I forgot to say also that I added the #define __has_builtin just above the function to address the MSVC issue.

Yup looks like i missed the pre-processing step, can also validate the outputs now.

This revision is now accepted and ready to land.Sep 3 2021, 6:57 PM

I fixed the warning suggested by Sergey and while doing that I noticed that the 1D sampler had a small error as it wa using 1/NUM_PMJ_DIVISIONS instead of 1/NUM_PMJ_SAMPLES. This would not cause any sampling issues but just increase the noise slightly as it was jittering outside the interval. This is because the interval size for the 1D samples are in 1/NUM_PMJ_SAMPLES intervals and not in the 1/NUM_PMJ_DIVISIONS x 1/NUM_PMJ_DIVISIONS boxes of the 2D sampler.