Page MenuHome

Cycles X: Shading performance improvements by changing inlining behavior for SVM
ClosedPublic

Authored by Patrick Mours (pmoursnv) on Jul 5 2021, 6:16 PM.

Details

Summary

I ran some profiles to get a better understanding on where the current Cycles X CUDA/OptiX kernels are limited and the most to gain was in the shading ones.

The set up kernels (reset, init_from_camera, ...) are primarily limited by the memory system (GPU is throttling them because of the amount of memory requests they execute). Which makes sense, since they don't do much work apart from setting up the state in global memory. Structure-of-arrays access helps immensely with caching across threads, but many writes in sequence with no math between to hide latency still causes throttling, since the GPU only has a limited size queue to handle those requests and when that is full, it cannot accept more and has to stall execution until there is space again. It's hard to work around that. Tried some ideas, like reducing the number of threads per block for those kernels to 128, to give each more space in L1 cache, but couldn't really get noticeable performance benefits (some metrics like L1 hit rate improved, but it didn't help the overall picture).

The shading kernels (shade_surface, ...) too are limited by memory a lot, but there are more diverse reasons there. I found several hotspots where execution was stalled waiting for spills to be loaded back into registers (both values, but also synchronization registers). That's something that is more easily adjustable by changing the inlining logic:
For example, the compiler did not inline "kernel_write_denoising_features" (even though it was marked __inline__), which caused it to force synchronization before the function call. Forcing it inline avoided that and got rid of that hotspot.
Then there was cubic texture filtering and NanoVDB, which introduced huge code chunks into each texture sampling evaluation (increasing register and instruction cache pressure), even though they are rarely actually used. Making them __noinline__ outsources that overhead to only occur when actually used.
Another case is the SVM. The compiler currently converts the node type switch statement into a binary searched branch sequence (there is some hope that in future it will compile that to an indirect jump, like it does for simpler switch statements, but that is not currently the case). This means depending on the SVM node hit, the GPU has to branch over large portions of code, which increases instruction cache pressure immensely (GPU is fetching lots of code even for stuff it immediately jumps away from again, while jumping through the binary searched branches). This can be reduced somewhat by making all the node functions __noinline__, so that the GPU only has to branch over a bunch of call instructions, rather than all the inlined code. As a side effect this also reduced register pressure, making it more localized to each node, which reduced spills, which in turn reduced local memory traffic, which had a positive effect on overall performance.
The SVM "offset" value is passed by value into the node functions now and returned through function return value, to make the compiler keep it in a register. Otherwise when passed as a pointer, in OptiX the compiler was forced to move it into local memory (since functions are compiled separately there, so the compiler is unaware of how that pointer is used. In CUDA this is less of an issue since the compiler should be able to figure that out as part of optimizing the whole kernel).
These changes improved overall CUDA/OptiX render times by up to 10% on both a RTX 2080 Ti and a RTX 3090. I ran with various different scenes (including one with NanoVDB, and one with an AO shader node), and in all cases performance was slightly better. But I'd encourage to test on your system first too. I didn't check the effect on CPU performance yet.

RTX 2080 Ti with OptiX:

BeforeAfter
bmw2713.0211.44
classroom8.157.59
junkshop23.822.92
pavillon_barcelona12.2111.07

Diff Detail

Repository
rB Blender

Event Timeline

Patrick Mours (pmoursnv) requested review of this revision.Jul 5 2021, 6:16 PM
Patrick Mours (pmoursnv) created this revision.
Patrick Mours (pmoursnv) edited the summary of this revision. (Show Details)Jul 5 2021, 6:19 PM
Patrick Mours (pmoursnv) edited the summary of this revision. (Show Details)

Nice detailed explanation :)
Think the change does make sense. Would be curios to see numbers from Brecht's system though.

Result on RTX6000

                              new                           cycles-x                      
bmw27.blend                   9.88381                       10.1719                       
classroom.blend               14.5931                       15.0894                       
pabellon.blend                8.6788                        8.9799                        
monster.blend                 10.3733                       10.6686                       
barbershop_interior.blend     10.175                        10.9238                       
junkshop.blend                15.5218                       16.3647                       
pvt_flat.blend                14.9                          14.985

Result on i9-11900k

                              new                           cycles-x                      
bmw27.blend                   133.83                        140.702                       
classroom.blend               219.058                       220.596                       
pabellon.blend                126.818                       126.892                       
monster.blend                 134.699                       137.614                       
barbershop_interior.blend     170.23                        169.374                       
junkshop.blend                191.095                       193.06                        
pvt_flat.blend                130.669                       132.284

Thanks for the investigation. I guess we to do more to reduce memory traffic. We have some ideas in T87836 to try for that.

With an RTX A6000, I also get some speedups:

                                         new                  old                  
barbershop_interior                      6.8324s              7.0675s              
bmw27                                    7.2180s              7.4321s              
classroom                                9.3709s              9.6966s              
junkshop                                 10.0036s             10.3727s             
monster                                  6.3217s              6.5380s              
pabellon                                 5.3753s              5.6998s

Note these were run using the benchmark tool that is now in the cycles-x branch.
https://wiki.blender.org/wiki/Tools/Tests/Performance

This revision is now accepted and ready to land.Jul 6 2021, 2:00 PM