Page MenuHome

Cycles: Change device-only memory to actually only allocate on the device
ClosedPublic

Authored by Patrick Mours (pmoursnv) on Feb 25 2021, 5:13 PM.

Details

Summary

Trying to render the Blender 2.92 splash screen (https://cloud.blender.org/p/gallery/60337d495677e942564cce76) with OptiX on a GPU with limited VRAM fails with an OPTIX_ERROR_INVALID_VALUE in optixAccelBuild... error. That's ... rather cryptic.
What's actually going on is that Cycles is trying to build an OptiX acceleration structure in host memory (allocated with cuMemHostAlloc), which is not allowed (it has to be in device memory, from cuMemAlloc), hence the error. This patch addresses that by amending the MEM_DEVICE_ONLY type to actually only allocate on the device and fail if that is not possible anymore because out-of-memory. In that case Cycle will now return with a "System is out of GPU memory" error message, which is much more easier to understand for end users.
This should not be a problem, since MEM_DEVICE_ONLY was seldomely used before anyway, so I just changed some of those instances to other memory types in case they do not have this restriction.

There is another problem here that I haven't addressed yet though:
The aforementioned scene shouldn't need as much memory as it does. I get a 16GB peak requirement, which requires a rather beefy GPU to meet. But during actual rendering, after everything was built and compacted, it actually sits around only 8GB.
The problem here is that Cycles is building all the bottom level acceleration structures for OptiX in parallel (geometry.cpp line 1933). For each acceleration structure it has to allocate some temporary memory on the GPU (for vertices, etc.), which, since this is running in parallel, accumulates to a huge amount of memory and is where the peak is coming from. If instead force all bottom level builds to run serialized, the problem goes away and I see a peak of only 9GB, which more consumer GPUs can handle.
I'm not sure how to expose this to users though. Ideally Cycles could automatically decide whether it makes more sense to run the builds in parallel or not, but that's probably difficult to predict. So maybe just add an option to choose? Or always run serialized for OptiX (this problem happens with CUDA too though, only that instead of VRAM getting exhausted, it's system RAM during the BVH2 build there)?

Just noticed that this exact issue also has been reported here: T85985

Diff Detail

Repository
rB Blender

Event Timeline

Patrick Mours (pmoursnv) requested review of this revision.Feb 25 2021, 5:13 PM
Patrick Mours (pmoursnv) created this revision.
Brecht Van Lommel (brecht) requested changes to this revision.Mar 1 2021, 4:27 PM

I'm not sure how to expose this to users though. Ideally Cycles could automatically decide whether it makes more sense to run the builds in parallel or not, but that's probably difficult to predict. So maybe just add an option to choose? Or always run serialized for OptiX (this problem happens with CUDA too though, only that instead of VRAM getting exhausted, it's system RAM during the BVH2 build there)?

It would be nice to make this more automatic, I imagine serializing this for many small meshes could have a serious performance impact?

Is there a simple heuristic we could use? Like, you need X amount of primitives to keep Y multiprocessors occupied?

intern/cycles/device/device_memory.h
273–275

I think it would be more clear to add a bool allow_host_memory_fallback = false parameter here, which then uses MEM_DEVICE_ONLY or MEM_READ_WRITE depending on the value.

Otherwise it's not obvious why you'd use one or the other in code.

This revision now requires changes to proceed.Mar 1 2021, 4:27 PM

Implemented allow_host_memory_fallback parameter (I agree, this is nicer) and fixed the high peak memory usage during OptiX acceleration structure building by limiting the actual OptiX acceleration structure building to a single thread at a time (using a mutex lock in build_optix_bvh).
This solved the problem in my tests, while still keeping the rest of the bottom-level BVH build running in parallel, which is noticeable faster in some scenes compared to just running it serialized (presumably because of the curve conversion loops).
With this peak memory usage did not exceed the memory usage during rendering in the splash screen scene, so was still able to render it successfully on a smaller GPU where it failed before. At the same time loading speed did not regress in a perceptible fashion.

Patrick Mours (pmoursnv) marked an inline comment as done.

Fixed incorrect parameter to some device_only_memory instances after previous change.

This revision is now accepted and ready to land.Mar 9 2021, 4:25 PM