Instead of calling into CUDA once per sample, do a loop for the number of samples being rendered at once. Adjust the number of samples dynamically to target 1.6s per CUDA kernel run.
Diff Detail
Event Timeline
In my informal local benchs, it seems to run more than twice as fast as the stock Blender. (On a GTX 970.)
I'd be happier with that if I understood why.
Somehow, it's specifically because the loop is inside the shader; if I switch to representing the loop as a z dimension at the cuda level, the speedup goes away.
By doing such a change there are two things happening here:
- You reduce overhead of kernel invokation
- You increase time between tile is written back to Blender (default frequency is once per second)
However, neither of this two things will give 2x speedup, unless you render empty scene.
Here is benchmark result of the patch on GTX1080.
| Scene | Master | Patch |
| bmw27 | 02:50.34 | 02:45.18 |
| classroom | 06:20.92 | 06:25.14 |
| fishy_cat | 05:46.16 | 05:35.99 |
| koro | 12:32.39 | 12:45.36 |
| pavillon_barcelone | 07:50.73 | 07:27.84 |
As you can see, the speedup is within 5%. Not sure why you see much higher speedup. What is your exact setup and what is the scene you render?
Though disappointing, it's good to see that the world at large is largely sane and it's only my computer that does this weird thing.
I'm running a GTX 970 on Ubuntu with driver 367.57.
I suspect now that the speedup is due to the fact that I'm running a machine learning task in the background. Depending on how the GPU's scheduler works, having fewer tasks that run for longer might mean they get more compute time?
Hm no, that's not it entirely.
That seems to play a part, but even with no other load on the system I still get
04:36:34 with patch applied
07:21.41 with stock Blender.
That's with 576 samples per pixel, 256x256 tilesize, 1920x1080 image. I suspect at that point we really do run into sync overhead.
Try a fast scene with a large number of samples.
Ugh, you really shouldn't have anything running on background when doing benchmark. Would evensuggest closing any CPU side applications.
If your background task uses more complex kernels, then surely scheduling bigger kernel from Cycles will improve performance for you.
But thing is: it'll be fully depending on the other tasks you're doing. There will never be best kernel in Cycles for the multi-process GPGPU. It'll all be dependent on timing, exact nature of background task, driver version and so on.
@Martijn Berger (juicyfruit) had different approach implemented here using async API which is something closer to what we need here.
Yeah I'd been looking into async myself but I couldn't figure out how to wait on multiple kernels to complete at once. Could you link it, please? It'd be interesting to see how he does it.
[edit] Bench for BMW from the benchmark set on my system:
05:25.92 bmw gpu stock
03:57.44 bmw gpu patched
Nevermind, I was comparing with an outdated Blender version.
Sorry for wasting your time.