Page MenuHome

Cycles: reduce CUDA stack memory usage for closures
AbandonedPublic

Authored by Brecht Van Lommel (brecht) on May 22 2016, 6:35 PM.

Details

Reviewers
None
Group Reviewers
Cycles
Summary
  • Perform closure merging earlier, as part of shader evaluation
  • Don't store unnecessary closures for emission/shadow shader evaluation
  • Reduce number of closures for emission/shadow evaluation to 2
  • Use same ShaderData for volumes and surfaces

Before:

ptxas info    : Function properties for kernel_cuda_path_trace
    49232 bytes stack frame, 1992 bytes spill stores, 3588 bytes spill loads
ptxas info    : Used 40 registers, 364 bytes cmem[0], 1100 bytes cmem[2]

ptxas info    : Function properties for kernel_cuda_branched_path_trace
    68144 bytes stack frame, 1344 bytes spill stores, 3620 bytes spill loads
ptxas info    : Used 64 registers, 364 bytes cmem[0], 1180 bytes cmem[2]

After:

ptxas info    : Function properties for kernel_cuda_path_trace
    19104 bytes stack frame, 1908 bytes spill stores, 3628 bytes spill loads
ptxas info    : Used 40 registers, 364 bytes cmem[0], 1116 bytes cmem[2]

ptxas info    : Function properties for kernel_cuda_branched_path_trace
    28992 bytes stack frame, 1360 bytes spill stores, 3612 bytes spill loads
ptxas info    : Used 64 registers, 364 bytes cmem[0], 1196 bytes cmem[2]

So that's roughly a 60% reduction in stack memory usage.

To see how much overhead is left from closures, if we set max closures to 1 we would get:

ptxas info    : Function properties for kernel_cuda_path_trace
    15008 bytes stack frame, 1920 bytes spill stores, 3624 bytes spill loads
ptxas info    : Used 40 registers, 364 bytes cmem[0], 1116 bytes cmem[2]

ptxas info    : Function properties for kernel_cuda_branched_path_trace
    17232 bytes stack frame, 1368 bytes spill stores, 3548 bytes spill loads
ptxas info    : Used 64 registers, 364 bytes cmem[0], 1196 bytes cmem[2]

Performance and correctness still need to be tested more, but in principle the reduced storage requirements should not affect any existing scenes besides perhaps rendering slight slower. Now that the merging happens earlier, we could consider lowering the max number of closures further, assuming that existing scenes mostly run out of closures due to duplicate BSDFs. But there could also be scenes that actually use 64 different closures.

Diff Detail

Repository
rB Blender
Branch
reduce

Event Timeline

Brecht Van Lommel (brecht) retitled this revision from to Cycles: reduce CUDA stack memory usage for closures.
Brecht Van Lommel (brecht) updated this object.

Tested on Geforce 730 (sm_52), with the bmw.blend.

Master: 36,6s - 447MB (nvidia-smi)
Patch: 37,5s- 360MB (nvidia-smi)

Looks reasonable for me.

3% slower is not nothing though, I still have some tricks to try optimizing it.

Hi, make some tests too:

BMW
Master: 690 MB 03:12.00
Patch: 380 MB 03:06.00
Koro
Master: 990 MB 04:06.00
Patch: 680 MB 03:52.00
Fishy
Master: 1.6 GB 01:25.00
Patch: 680 MB 01:26.00

Render on both cards, got artifacts with patched Blender and BMW.
EDIT: BMW glas has artifacts on one GPU too.
Nvidia-smi checked only headless card.

Blender a6b2189
Opensuse Leap 42.1 x86_64
Intel i5 3570K
RAM 16 GB
GTX 760 4 GB /Display card
GTX 670 2 GB
Driver 361.42

Mib

Brecht Van Lommel (brecht) planned changes to this revision.EditedMay 23 2016, 10:31 PM

Thanks for the tests, there indeed seems to be something I have to fix here. Testing with a GTX 960 I also saw 1-3% speedups rather than slowdowns, on multiple benchmark scenes.

In rB999d5a67852b: Cycles CUDA: reduce stack memory by reusing ShaderData. I've committed a simpler and safer change that gets us a 57% reduction for path and 48% reduction for branched path already, explicitly reusing ShaderData memory. I would have expected the compiler to be able to do some liveness analysis and notice it can save this memory, but that doesn't seem to be that case. So now in master I get these stats, and about 1-2% faster renders.

ptxas info    : Function properties for kernel_cuda_path_trace
    20992 bytes stack frame, 1924 bytes spill stores, 3636 bytes spill loads
ptxas info    : Used 40 registers, 364 bytes cmem[0], 1100 bytes cmem[2]

ptxas info    : Function properties for kernel_cuda_branched_path_trace
    35136 bytes stack frame, 1344 bytes spill stores, 3824 bytes spill loads
ptxas info    : Used 64 registers, 364 bytes cmem[0], 1180 bytes cmem[2]

On the BMW scene memory goes from 995 MB to 525 MB. Note that this 470 MB is a fixed number that depends only on the number of CUDA cores and tile size, not scene complexity, so with heavier scenes it's not going to look as impressive.

Hi, check again with latest master and got slightly higher memory usage and slowdown on most test files.
For example:

Koro
Master a6b2189 04:06.00 990MB
Patch 03:56.00 680MB
Master 999d5a6 04:20.00 720 MB

My cards are CC 3.0.
Will check with buildbot tomorrow.

Update for latest master, try to avoid pointer indirection.

Ok, the memory usage is normal but wouldn't expect that much of a slowdown. So this is a multi GPU render, with the default koro_gpu.blend?

Ideally tests should be run e.g. 3 times, alternatingly with the new/old builds to remove bias due to system load, heat, etc. Multi GPU can also give quite random results depending on which GPU model happens to render the last tile, so always best to check with a single GPU for reference too.

This is what I get with the latest patch, slight less memory usage still. About the same render times as the original patch on my GTX 960.

ptxas info    : Function properties for kernel_cuda_path_trace
    16816 bytes stack frame, 1944 bytes spill stores, 3660 bytes spill loads
ptxas info    : Used 40 registers, 364 bytes cmem[0], 1112 bytes cmem[2]

ptxas info    : Function properties for kernel_cuda_branched_path_trace
    26752 bytes stack frame, 1340 bytes spill stores, 3912 bytes spill loads
ptxas info    : Used 64 registers, 364 bytes cmem[0], 1192 bytes cmem[2]
Brecht Van Lommel (brecht) planned changes to this revision.May 24 2016, 1:08 AM

Here's some stats to understand stack memory usage. I didn't compare timings, that's not the purpose, some of these will obviously not work:

PathBranched Path
Master2099235136
Set SVM_STACK_SIZE to 11996834112
Set BVH_STACK_SIZE to 11793632096
Share kernel_path_indirect ShaderData2099230144
Uninline kernel_path_indirect2099229904
Uninline kernel_path_indirect + set MAX_CLOSURE to 11241612752
Remove __inline__ from ccl_device2108835584
Set __noinline__ on ccl_device2748835792

For the case where kernel_path_indirect ShaderData is shared, we can break things down roughly like this.

SizePathBranched Path
PathState1841+?2+?
PathRadiance46412
Ray1121+?2+?
Intersection241+?2+?
ShaderData - ShaderClosures[MAX_CLOSURE]43224
ShaderClosure[MAX_CLOSURE]512024
SVM stack102011
BVH stack76844
SubsurfaceIntersection35211
SubsurfaceIndirectRays360011
Estimated Sum1993231356
Reported by ptxas2099230144

It's not really possible to sum everything that simply though, since the compiler doesn't have to keep local memory reserved for all this data all the time. For example for the case where MAX_CLOSURE is set to 1, we might expect memory to go down to 20992 - (5120*63/64)*2 = 10912, but ptaxs still gives us 12416. So there might still be some important memory usage that is not accounted for in this table.

I committed the optimization to share ShaderData for kernel_path_indirect in rBb49185df99d9: Cycles CUDA: reduce branched path stack memory by sharing indirect ShaderData., which will only affect branched path tracing. I found no performance regression with a GTX 960 on Windows.

It seems CUDA 8.0 RC is a bit better at reducing stack usage, but there is a render time regression.

PathBranched PathBMW
Master CUDA 7.520992B30144B00:52.19
Master CUDA 8.0 RC17616B26560B00:58.11

Obsolete due to cycles-x.