Now that OptiX denoising can be used on non-RTX GPUs, it will become more common for users to render with CUDA, but use OptiX denoising. Previously this was really slow in the viewport, especially with multiple GPUs, since the whole buffer was copied around multiple times before each denoising step (from CUDA devices to host, then from host to OptiX device, then from OptiX device back to host and finally back to the CUDA devices).
This patch addresses that by recognizing when a logical OptiX and CUDA device represent the same physical GPU and attempting to eliminate those copies if that is the case for all active devices (similar to what is happening when OptiX is used for both rendering and denoising). In addition, denoising is now no longer performed on the first available OptiX device only, but instead it will try to match CUDA and OptiX rendering/denoising devices exactly if possible (to maximize utilization).
This also fixes T75289 and T77593 (with the changes to session.cpp) and a race condition when denoising with multiple GPUs (since map_neighbor_tiles is not thread-safe).