Page MenuHome

Cycles: fix Progressive Multi-Jittered sampling.
ClosedPublic

Authored by Nathan Vegdahl (cessen) on Aug 21 2022, 7:26 PM.

Details

Summary

This is a follow-up to D15679, completing what was still left over in order to fully replace D15485. Namely, fixing the PMJ sampler.

There were two main things that were broken in the PMJ sampler, resulting in more noise:

  • The sample pattern generating code itself wasn't quite right, and resulted in patterns that weren't fully progressively stratified in the way that PMJ02 is supposed to be. (Edit: turns out PMJ02 doesn't have to be fully progressively stratified. But it is nevertheless an improvement, and one that the PMJ authors recommend.) Specifically, only power-of-two *prefixes* were progressively stratified, not suffixes. This resulted in unnecessarily increased noise when using non-power-of-two sample counts.
  • In order to try to get away with just a single sample pattern, the code used a combination of sample index shuffling and Cranley-Patterson rotation. Index shuffling is normally fine, but due to the sample patterns themselves not being quite right (as described above) this actually resulted in additional increased noise. Cranley-Patterson, on the other hand, always increases noise with randomized (t,s) nets like PMJ02, and should be avoided with these kinds of sequences.

This patch fixes those issues with the following changes:

  • It replaces the sample pattern generation code with a much simpler algorithm recently published in the paper "Stochastic Generation of (t, s) Sample Sequences". This new implementation is easier to verify, produces fully progressively stratified PMJ02, and is *far* faster than the previous code, being O(N) in the number of samples generated.
  • It keeps the sample index shuffling, which works correctly now due to the fixed improved sample patterns. But it now uses a newer high-quality hash instead of the original Laine-Karras hash.
  • The scrambling distance feature cannot (to my knowledge) be implemented with any decorrelation strategy other than Cranley-Patterson, so Cranley-Patterson is still used when that feature is enabled. But it is now disabled otherwise, since it increases noise.
  • In place of Cranley-Patterson, multiple independent patterns are generated and randomly chosen for different pixels and dimensions as described in the original PMJ paper. In this patch, the pattern selection is done via hash-based shuffling to ensure there are no repeats within a single pixel until all patterns have been used.

The combination of these fixes brings the quality of Cycles' PMJ sampler in line with the previously submitted Sobol-Burley sampler
in D15679. They are essentially indistinguishable in terms of quality/noise, which is expected since they are both randomized
(0,2) sequences.

Diff Detail

Repository
rB Blender

Event Timeline

Nathan Vegdahl (cessen) requested review of this revision.Aug 21 2022, 7:26 PM
Nathan Vegdahl (cessen) created this revision.
  • Properly mask off sample index shuffling in PMJ sampler.
Nathan Vegdahl (cessen) edited the summary of this revision. (Show Details)Aug 21 2022, 8:05 PM
Nathan Vegdahl (cessen) edited the summary of this revision. (Show Details)Aug 21 2022, 8:12 PM
  • Whitespace cleanup.

Overall seems fine.
One detail that I noticed: The shuffling logic here completely reorders the samples inside each pattern. I had assumed that this is a problem since the patterns are designed to that the full pattern and each power-of-two prefix are properly stratified, but not each random subset. So, if we render with e.g. 32 samples, instead of using the first 32 samples of the pattern in a shuffled order (and getting a stratified prefix), we use 32 random samples out of the 1024. Therefore the reverse(hash(reverse(i), seed) ^ hash(0, seed)) trick in my patch - that ensures that we shuffle only within each power-of-two increase. Is that not actually needed? Maybe I just misunderstood the PMJ paper.

  • Run clang-format.

I had assumed that this is a problem since the patterns are designed to that the full pattern and each power-of-two prefix are properly stratified, but not each random subset. So, if we render with e.g. 32 samples, instead of using the first 32 samples of the pattern in a shuffled order (and getting a stratified prefix), we use 32 random samples out of the 1024.

Completely random contiguous subsets aren't fully stratified, correct. For example, if you take 32 samples starting from index 13, those samples aren't stratified. But the first and second halves of a (power-of-two) PMJ02 sequence are fully stratified, so if you have a 1024-long sequence then you can start from 512, for example. But also, the first and second halves of each of those halves are also stratified. And the halves of those halves as well. And so on, all the way down to single samples. So if you want 32 samples, the only requirement for it being stratified is that its starting index is a multiple of 32. And if you want 8 samples, it only needs to be aligned to a multiple of 8. Etc.

Doing a nested uniform shuffle on sample indices ensures that even though the starting sample can be any random sample, subsequent samples will always be selected such that the contiguous set stays aligned to the next power of two (e.g. when you're at 5, 6, 7, or 8 samples, it will always be 8-aligned).

Having said that, the sample pattern generation code (before this patch) didn't actually follow that property in its generated sequences. So indeed, the kinds of scrambling you could do with that was more limited.

Maybe I just misunderstood the PMJ paper.

I may have also misunderstood. I was fairly sure the original paper arranged the sequence order to have the property I mentioned above, but maybe I was mistaken? That would then explain why the generating code in Cycles didn't have that property. But in general, it's a really important property for progressive sequences like this to have (even sans shuffling), because without it you can have arbitrarily bad noise increases at non-power-of-two sample counts. For example, imagine if the second 512 samples of a 1024-length sequence were organized so that its first 256 samples were entirely in the top half of the unit square and the second 256 in the bottom. So I was fairly confident the original paper ensured that property. But I can double-check.

In any case, the new sample pattern generation method that I've implemented does have that property. So it's at least valid to shuffle samples that way now.

  • Improved code comments.

@Lukas Stockner (lukasstockner97)
Ah, also, to further clarify: nested uniform shuffling doesn't ensure that the selected samples are actually contiguous in the sequence, but they don't need to be. They just need to be filled in in the right order within an aligned power of two segment. Basically, as long as all power-of-two subdivisions of your sequence are individually stratified (which is true of Sobol and I think PMJ02 as well), then that both implies that the sequence is progressively stratified and that nested uniform shuffling won't harm that progressive stratification.

  • Fix accidental compile fail in previous commit.
  • Fix typo in comment.

@Lukas Stockner (lukasstockner97)

I had assumed that this is a problem since the patterns are designed to that the full pattern and each power-of-two prefix are properly stratified, but not each random subset.

I went back and re-read the paper, and you were totally right. PMJ02 as specified in the original paper doesn't guarantee that suffixes or power-of-two divisions are fully stratified. They do suggest making it that way in this quote from the paper:

A useful improvement of the pmj02 samples between octaves can be obtained by a simple change: choose sub-quadrants such that the new samples are themselves (0,2) sequences. For example, the samples 32...47 and 48...63 should themselves be (0,2) sequences. This yields sequences that fully match Owen-scrambled Sobol’s error for all sample counts.

But it doesn't appear they actually show how to do that in their example code(?). (They also discuss how not doing so hurts convergence at non-power-of-two sample counts.) So that explains why the implementation in Cycles was the way it was.

In any case, their more recently published paper "Stochastic Generation of (t, s) Sample Sequences"—which is what this patch's implementation is based on—does follow that property. Which means both better behavior at non-power-of-two sample counts, as well as it being fine to directly shuffle the point indices with a nested uniform scramble.

Nathan Vegdahl (cessen) edited the summary of this revision. (Show Details)Aug 22 2022, 3:51 AM
Nathan Vegdahl (cessen) edited the summary of this revision. (Show Details)Aug 22 2022, 3:57 AM
Nathan Vegdahl (cessen) edited the summary of this revision. (Show Details)Aug 22 2022, 4:00 AM

The implementation looks fine, will run some benchmarks here.

Do you plan to do additional changes to sampling patterns soon, since blue noise and 3D/4D sampling were mentioned in previous tasks? If so I might hold off on committing this to avoid updating the reference images multiple times.

Performance seems to about the same on CPU and GPU, below timing noise.

Improvements in regression tests match the images from D15195.

This revision is now accepted and ready to land.Aug 24 2022, 3:43 PM
  • Seed PMJ sample pattern generator properly.

The seed input appears to be a simple incrementing integer, so it
needs to be randomized before use.

Do you plan to do additional changes to sampling patterns soon, since blue noise and 3D/4D sampling were mentioned in previous tasks? If so I might hold off on committing this to avoid updating the reference images multiple times.

The version of screen-space blue-noise sampling I'm planning to implement would be built on top of the Sobol-Burley sampler (it requires a long sequence), so the PMJ sampler shouldn't be affected.

Having said that, I haven't replaced all uses of the CMJ hash yet: it's also used outside of the PMJ sampler, such as in the subsurface scattering code. Doing so would affect at least some of the reference images, so if we want to do that I can roll that into this PR.

Edit: forgot to respond to the 3d/4d sampling bit.

3d/4d sampling would also only apply to the Sobol-Burley sampler, since PMJ02 can only produce 2D sample sequences (at least, without an expensive minutes-long optimization step).

Although we could compute Owen-scrambled Sobol tables at runtime in place of PMJ02, since it performs the same but has natural extensions to 3d/4d (and can also be generated by the same code, just with a different xor table). But I think that starts to get into a larger discussion of what we want the default sampler to be, and I think some more exploration would probably be valuable before that.

So for the time being, I think just fixing PMJ02 is fine...?

If you want to change some CMJ hashes that would indeed be good to included in this.

I agree that we can do a Sobol-Burley update separately.

  • Remove CMJ hash, and replace its uses with the Hash-Prospector hash.
  • Change the Wang float hash to produce [0.0, 1.0) instead of [0.0, 1.0], to be consistent with the Hash-Prospector hash since its intended as a faster drop-in replacement.

Great! I've taken care of the CMJ hash now. So as long as everything looks good to you, it's ready to land.

One more question about this in kernel/sample/pattern.h. Is it still valid?

ccl_device_inline bool sample_is_even(int pattern, int sample)
{
  if (pattern == SAMPLING_PATTERN_PMJ) {
    /* See Section 10.2.1, "Progressive Multi-Jittered Sample Sequences", Christensen et al.
     * We can use this to get divide sample sequence into two classes for easier variance
     * estimation. */
     return popcount(uint(sample) & 0xaaaaaaaa) & 1;
  }
  else {
    /* TODO(Stefan): Are there reliable ways of dividing CMJ and Sobol into two classes? */
    return sample & 0x1;
  }
}

And can Sobol-Burley be enabled for adaptive sampling?

return popcount(uint(sample) & 0xaaaaaaaa) & 1;

As long as this does what's described in the paper, then it's still valid, yes. It's not immediately obvious to me why that does what the paper describes, but I assume whoever wrote it worked that out. (I did a quick test, and it does appear to yield the pattern from the paper.)

And can Sobol-Burley be enabled for adaptive sampling?

We can do that, but not with the code given. With Sobol (and by extension, Sobol-Burley) my understanding is that you want to use one of the unused dimensions to determine which sample class to put the sample in. For example, if our highest sampling dimension is 2 (as it currently is) then we would use dimension 3 with a n < 0.5 test for the classification. If we move to using 3 or 4 dimensions, then we'd use the 4th or 5th dimension, respectively.

Since it's just a n < 0.5 test, I believe we can optimize it with bit-fiddling similar to the current PMJ sample classification code, so we wouldn't actually have to fully compute the dimension. I'd be happy to take a crack at that in another patch if you'd like me to.

Thanks for checking. If you can get Sobol-Burley to work for adaptive sampling that would be great.

I implemented adaptive sampling for Sobol-Burley (not submitted yet, for reasons I'll explain).

I started off by just doing n < 0.5 on the third dimension, as reference. Then I implemented the bit-fiddling version (and in the process re-derived how the current adaptive PMJ implementation works). To use the third Sobol dimension for classifying the samples, it's just this:

popcount(index & 0xdb6db6db) & 1

(The constant is the most-significant-bit column of the Sobol matrix. And popcount(index & column) & 1 is multiplying the index by that most-significant-bit column.)

The bit-fiddling version matched the reference n < 0.5 version exactly, so it seems I did it correctly.

However, just for fun, I decided to also try the 0xaaaaaaaa constant from the adaptive PMJ implementation. And it actually seemed to work better. So then I also tested it against the 3rd and 4th dimensions of Sobol (since we might want to use those at some point). And it worked really well for those as well.

In retrospect, this isn't super surprising. The first two dimensions of Sobol are an (0, 2) sequence, just like PMJ02, and the order in which it places samples follows a very similar pattern. But the other way to look at it (which also explains the results with dimensions 3 and 4) is that 0xaaaaaaaa is significantly different from any of the Sobol matrices' most-significant-bit columns, so it doesn't correlate with them. Which at least suggests that it potentially can classify them into two groups effectively.

Having said all of that, I haven't investigated this thoroughly yet, so there may yet be issues I haven't run across. It seems to work remarkably well in my testing so far. But I'd like to take some time to dive in and verify that using 0xaaaaaaaa actually is good for Sobol-Burley as well, and that it's not just a coincidence with the scenes I happen to be testing.

I've investigated this more thoroughly, and discovered some interesting things. But the short version is: we should probably use 0xaaaaaaaa for Sobol-Burley as well.

(In all the images below, I'm showing only one class of samples--half of the total.)

On unshuffled Sobol sequences, 0xaaaaaaaa works terribly:

But shuffling the sample indices changes things dramatically:

For comparison, here is the same shuffled sequence, but using the third sobol dimension to classify the samples:

In other words, with index shuffling, both approaches end up with a fair bit of randomness. Importantly, it doesn't make them completely random. Here's a high-sample-count comparison to pure random samples:

Classified SobolRandom

The classified samples are more evenly distributed, with less clumping.

Based on just that, it seems like it doesn't matter whether we use a higher Sobol dimension or 0xaaaaaaaa to do the classification. But at low sample counts 0xaaaaaaaa seems to be better. For example, here are worst-case outcomes at 8 samples (4 samples in each class, only one class shown):

Higher Sobol dimension0xaaaaaaaa

0xaaaaaaaa does a better job of ensuring that each dimension is individually sampled relatively evenly, even if there are holes in the higher-dimensional plot. (Note: best case has one sample in each quadrant for both approaches.) This seems to be reflected in the render tests I've done, where 0xaaaaaaaa does a better job of identifying areas of already converged noise, which in turn results in lower render times.

All the 2d projections of 4-dimensional shuffled Sobol also seem to perform about the same as shown above with each approach. So this should also be good if we move to using more dimensions of Sobol-Burley as well.

I also checked the "Sample Count" render layer results against adaptive pmj02 on a variety of scenes, and Sobol-Burley with 0xaaaaaaaa appears to give more-or-less identical results to pmj02.

So I feel pretty confident at this point that we should just use 0xaaaaaaaa for both pmj02 and Sobol-Burley.

(As an aside: I'm also getting increasingly suspicious that shuffled scrambled Sobol and (fully progressive) pmj02 are just two ways of creating the same randomized sequence.)

As far as moving forward goes, we have two sampling-related patches in flight right now: this patch and D15788. Does it perhaps make sense to just get these two patches merged first, and then I can submit a third small patch enabling adaptive sampling for Sobol-Burley?

intern/cycles/util/hash.h
492

I noticed result should actually be i, will fix that in the commit.

Thanks a lot of figuring out the adaptive sampling stuff, I'll commit the PMJ changes now and then you can submit another patch.

intern/cycles/util/hash.h
492

Oof! That's quite the goof on my part. Thanks for catching that!