Page MenuHome

Cycles X: set integrator state size relative to the number of GPU cores
ClosedPublic

Authored by Brecht Van Lommel (brecht) on Sep 8 2021, 6:27 PM.

Details

Summary

More specifically, 16x the max number of threads on all multiprocessors,
with 1048576 minimum.

What this effectively does is double the state size on the very high end
GPUs like RTX A6000 and RTX 3080 while leaving the size unchanged for
others. On the RTX A6000 I there are 2-10% render time reductions on our
benchmark scenes. The biggest reduction is on the barbershop interior, as
scenes with more objects and shaders are more likely to benefit from
improved coherence.

This also adds an environment variable for developers to test different
sizes, and debug logging about the size and memory usage.

Diff Detail

Repository
rB Blender
Branch
num-states (branched from master)
Build Status
Buildable 16879
Build 16879: arc lint + arc unit

Event Timeline

Brecht Van Lommel (brecht) requested review of this revision.Sep 8 2021, 6:27 PM
Brecht Van Lommel (brecht) created this revision.

That is an interesting heuristic!

Do you expect the CYCLES_CONCURRENT_STATES_FACTOR to stick around for a while, or is it something short-living? If former, would think it belongs to the Debug panel (can be initialized from the environment variable, similar to some other fields in there).

Will run benchmark later on.

I ran some tests to get a sense of how much increasing the state size helps in practice.

On my GPU increasing it by 2x gave a real speedup, but there are diminishing returns and high memory cost if we go much further than that. It would be good to test this on some other GPUs to get a better idea if the current heuristic works well enough.

The further optimization would be to dynamically increase the integrator state size if there is more GPU available, but that's for another patch.

This is the benchmark configuration I used for generating the graph (you'd need to change the git hash to use it):

devices = ['OPTIX*']
categories = ['cycles']
revisions = {
'0.125x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '0.125'}],
'0.25x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '0.25'}],
'0.5x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '0.5'}],
'1x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '1'}],
'2x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '2'}],
'3x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '3'}],
'4x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '4'}],
'8x': ['fece896', {'CYCLES_CONCURRENT_STATES_FACTOR': '8'}],
}

The environment variable was mainly added for this purpose. I could add it as a debug option instead though.

Slightly different picture from RTX 6000:

I'm happy to see this tackled. I used 2x on multi-GPU configurations of 2x RTX 3090 and 2x 3070Ti and it was still the optimum. Testing on older generation is still a todo. A formula based on the number of CUDA cores or something like that would indeed be Ideal. In my experience with current Cycles (current master and 2.93) though, the number of CUDA cores doesn't really matter and it seems there are more like absolute optimums.

Is there a doc somewhere about the benchmark process? After applying the patch and entering the new git hash in the benchmark file, how do I start the actual bench and select the files (or folder containing them) it should be run with? I could test on 1050Ti, 980Ti, 1080Ti, 2070 super, 2080Ti, 3070Ti and 3090.

Just curious about a few things I put in the comments.

intern/cycles/device/cuda/queue.cpp
42

Just curious if limiting max based on the state size would be anything or does that not really play a factor? Also would it be useful to factor in the scene size so as to limit the total overall memory so that it fits better on the card or cards available?

Brecht Van Lommel (brecht) marked an inline comment as done.Sep 9 2021, 5:52 PM

Slightly different picture from RTX 6000:

It's confusing, but the 2x in your graph corresponds to the 1x in my graph, since I already applied the 2x change as part of this patch. That makes them a bit more similar.

For these two GPUs, I would say the current estimate is decent. There still a little bit to be gained by dynamically increasing the size based on available memory, but it's not entirely obvious to me that we should. Results from more GPUs would help though.

intern/cycles/device/cuda/queue.cpp
42

Yes, this is something worth investigating (it's a bullet point in T87836: Cycles: GPU Performance).

We currently allocate integrator state memory before scene memory, so we know it is on the GPU and not potentially moved to CPU host memory, which would be slow. My idea was to allocate a minimum size first, then allocate the scene, then if more space is left over on the GPU increase the state size further. And then free that extra space quickly again to keep room for other GPU memory (like Blender textures and vertex buffers).

This is a bit more involved so I wanted to do that as a separate patch. It's also not clear if that's worth doing, if it's only for a couple %, faster. Taking up nearly all GPU memory can negatively affect other Blender operations or other application running in parallel to Cycles rendering.

It's confusing, but the 2x in your graph corresponds to the 1x in my graph

Ugh. But then it goes even more confusing: what is the measured speedup of this patch compared to the non-patched code? Or did you want to collect some statistics on the multiplier before getting the final before/after timing?

This revision is now accepted and ready to land.Sep 9 2021, 6:23 PM
Sergey Sharybin (sergey) requested changes to this revision.Sep 9 2021, 6:24 PM

Haiyaa. Didn't meant to accept, but i can't un-accept. Afraid you're doomed to have red icon for the time being :(

This revision now requires changes to proceed.Sep 9 2021, 6:24 PM

I've compared current cycles-x branch with the patch applied:

                                         cycles-x             D12432
barbershop_interior                      0.1373s              0.1353s
bmw27                                    0.0083s              0.0081s
classroom                                0.0876s              0.0885s
junkshop                                 0.0808s              0.0796s
monster                                  0.0412s              0.0414s
pabellon                                 0.0342s              0.0342s

Guess we do need to apply some multiplier to see a speedup.

Ok, I see. It is how it is how the math works for my GPU.
If there is speedup for A6000 I think is good idea to commit the change. While further improvements are possible, this is already a measurable speedup for certain GPUs.

This revision is now accepted and ready to land.Sep 9 2021, 7:45 PM

The performance tool is described in our Wiki: https://wiki.blender.org/wiki/Tools/Tests/Performance

Thanks, will test asap! I couldn't find a parameter to design a directory containing the blend files, so I guess it has the .blend files from SVN hard coded for a specific location?

Brecht Van Lommel (brecht) marked an inline comment as done.Sep 16 2021, 7:22 PM

I'll commit the patch in the current state, but more benchmarks are always welcome.