This patch reduces thread divergence in kernel_shaders_sort.
Rays are sorted in packs of 2048 according to sd->shader.
The execution time of kernel_shader_sort is halved, giving performance boost in Clasroom (~30%) and Pabellon Barcelone (~8%).
Differential D2598
Cycles: Split kernel - sort shaders Authored by Hristo Gueorguiev (nirved) on Apr 4 2017, 12:30 PM. Tags None Subscribers None
Details This patch reduces thread divergence in kernel_shaders_sort. Rays are sorted in packs of 2048 according to sd->shader. The execution time of kernel_shader_sort is halved, giving performance boost in Clasroom (~30%) and Pabellon Barcelone (~8%).
Diff Detail
Event TimelineComment Actions Nice speedups! Needs few style issues fixed and maybe some renaming, see inline comments. Also instead of shader_eval_1 shader_eval_sort shader_eval_2 can we have this? shader_setup shader_sort shader_eval Can't seem to get it working with CUDA for some reason, kernels fail to run, maybe something to do with local memory?
Comment Actions
Comment Actions Only minor stuff, don't want to be too picky. I think it will be fine to commit, but maybe would be nice if @Sergey Sharybin (sergey) or someone else could test CUDA split first (I would test it but I've been having issues with CUDA lately).
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||