Page MenuHome

Cycles: Add support for P2P memory distribution (e.g. via NVLink)
ClosedPublic

Authored by Patrick Mours (pmoursnv) on Apr 14 2020, 5:23 PM.
Tags
Subscribers
None
Tokens
"Love" token, awarded by Raimund58."Like" token, awarded by jbakker."Love" token, awarded by bnzs."Love" token, awarded by Tetone."Love" token, awarded by ReinhardK."Love" token, awarded by dark999."Love" token, awarded by Alaska."Love" token, awarded by marcog.

Details

Summary

The current multi-device implementation in Cycles would always duplicate memory for every device. This is generally good for optimal performance since each device will have fast access to a local copy of the data. There are however systems that are able to share quick access to memory between devices. One example is having multiple GPUs connected to each other via NVLink bridges. One such GPU has very fast access to the memory on the connected GPU, pretty much as if it were local.
In these cases it is benefitial to distribute memory across devices, since it reduces the overall memory footprint per device and makes it possible to load larger scenes that would otherwise not fit on the GPU. Two RTX 8000 connected via NVLink can store a 96GB scene this way, whereas right now they could only fit 48GB.

This change modifies the multi-device implementation to support memory distribution across devices. To make this independent of the device backend, the concept of P2P islands is introduced: rather than only having a list of all devices, it now builds a list of connected devices as well. In a system with 1 CPU and 4 GPUs, of which 2 each are connected via NVLink, this would create 3 P2P islands for example (the CPU, the first two GPUs and the last two GPUs). Memory then only has to be allocated once per island, rather than once per device.

The "victor" benchmarking scene for example goes from ~8.3GB per device down to ~6GB on two RTX 2080 Ti.

Since this does not only catch NVLink systems, a new option was added to the settings that toggles whether memory should be distributed (the new additional behavior) or copied for each device (the old behavior). There are system configurations that support P2P, but at lower performance (e.g. P2P without NVLink), so always enabling it would not have been benefitial for those. To avoid cluttering the UI, this new option only becomes visible if the user selects at least two devices that support P2P access with each other.

Example: On this system the two RTX 2080 Tis are connected with NVLink bridge and the RTX 8000 stands on its own:


Diff Detail

Repository
rB Blender
Branch
cycles_nvlink (branched from master)
Build Status
Buildable 8404
Build 8404: arc lint + arc unit

Event Timeline

Patrick Mours (pmoursnv) requested review of this revision.Apr 14 2020, 5:23 PM

Fixed rendering and denoising in viewport with P2P memory distribution

Fixed overlapping regions with viewport denoising by allocating viewport tile buffer on all devices again

I don't have the hardware to test this, but also could not find any bugs reading the code. So looks good to me.

This revision is now accepted and ready to land.Jun 2 2020, 8:29 PM

I noticed there may be a problem when the scene is trying to allocate more memory than both GPUs can provide together and falls back to CPU memory, so need to investigate some more before going through with committing.

Found a couple of bugs with the current CUDA memory management in Cycles when it comes to multiple devices.
The code that moved textures from device to host memory failed to actually free up the memory after moving as soon as multiple GPUs were enabled for example. In addition, the "texture_info" array was always moved to host memory on the first move (which may be a reason for T75955).

These are now fixed and things work properly with peer devices as well. So if one GPU runs out of memory, it will only try to move textures from its own memory to the host and update any connected peer devices accordingly.