I ran some profiles to get a better understanding on where the current Cycles X CUDA/OptiX kernels are limited and the most to gain was in the shading ones.
The set up kernels (reset, init_from_camera, ...) are primarily limited by the memory system (GPU is throttling them because of the amount of memory requests they execute). Which makes sense, since they don't do much work apart from setting up the state in global memory. Structure-of-arrays access helps immensely with caching across threads, but many writes in sequence with no math between to hide latency still causes throttling, since the GPU only has a limited size queue to handle those requests and when that is full, it cannot accept more and has to stall execution until there is space again. It's hard to work around that. Tried some ideas, like reducing the number of threads per block for those kernels to 128, to give each more space in L1 cache, but couldn't really get noticeable performance benefits (some metrics like L1 hit rate improved, but it didn't help the overall picture).
The shading kernels (shade_surface, ...) too are limited by memory a lot, but there are more diverse reasons there. I found several hotspots where execution was stalled waiting for spills to be loaded back into registers (both values, but also synchronization registers). That's something that is more easily adjustable by changing the inlining logic:
For example, the compiler did not inline "kernel_write_denoising_features" (even though it was marked __inline__), which caused it to force synchronization before the function call. Forcing it inline avoided that and got rid of that hotspot.
Then there was cubic texture filtering and NanoVDB, which introduced huge code chunks into each texture sampling evaluation (increasing register and instruction cache pressure), even though they are rarely actually used. Making them __noinline__ outsources that overhead to only occur when actually used.
Another case is the SVM. The compiler currently converts the node type switch statement into a binary searched branch sequence (there is some hope that in future it will compile that to an indirect jump, like it does for simpler switch statements, but that is not currently the case). This means depending on the SVM node hit, the GPU has to branch over large portions of code, which increases instruction cache pressure immensely (GPU is fetching lots of code even for stuff it immediately jumps away from again, while jumping through the binary searched branches). This can be reduced somewhat by making all the node functions __noinline__, so that the GPU only has to branch over a bunch of call instructions, rather than all the inlined code. As a side effect this also reduced register pressure, making it more localized to each node, which reduced spills, which in turn reduced local memory traffic, which had a positive effect on overall performance.
The SVM "offset" value is passed by value into the node functions now and returned through function return value, to make the compiler keep it in a register. Otherwise when passed as a pointer, in OptiX the compiler was forced to move it into local memory (since functions are compiled separately there, so the compiler is unaware of how that pointer is used. In CUDA this is less of an issue since the compiler should be able to figure that out as part of optimizing the whole kernel).
These changes improved overall CUDA/OptiX render times by up to 10% on both a RTX 2080 Ti and a RTX 3090. I ran with various different scenes (including one with NanoVDB, and one with an AO shader node), and in all cases performance was slightly better. But I'd encourage to test on your system first too. I didn't check the effect on CPU performance yet.
RTX 2080 Ti with OptiX:
| Before | After | |
| bmw27 | 13.02 | 11.44 |
| classroom | 8.15 | 7.59 |
| junkshop | 23.8 | 22.92 |
| pavillon_barcelona | 12.21 | 11.07 |