Page MenuHome

Cycles: Adapt shared kernel/device/gpu layer for MSL
ClosedPublic

Authored by Michael Jones (michael_jones) on Nov 4 2021, 2:01 PM.
Tags
None
Subscribers
None
Tokens
"Burninate" token, awarded by Gelert."100" token, awarded by mycoconut."Burninate" token, awarded by osakawayne."Love" token, awarded by Alaska."Burninate" token, awarded by Raimund58.

Details

Summary

This patch adapts the shared kernel entrypoints so that they can be compiled as MSL (Metal Shading Language). Where possible, the adaptations avoid changes in common code.

In MSL, kernel function inputs are explicitly bound to resources. In the case of argument buffers, we declare a struct containing the kernel arguments, accessible via device pointer. This differs from CUDA and HIP where kernel function arguments are declared as traditional C-style function parameters. This patch adapts the entrypoints declared in kernel.h so that they can be translated via a new ccl_gpu_kernel_signature macro into the required parameter struct + kernel entrypoint pairing for MSL.

MSL buffer attribution must be applied to function parameters or non-static class data members. To allow universal access to the integrator state, kernel data, and texture fetch adapters, we wrap all of the shared kernel code in a MetalKernelContext class. This is achieved by bracketing the appropriate kernel headers with "context_begin.h" and "context_end.h" on Metal. When calling deeper into the kernel code, we must reference the context class (e.g. context.integrator_init_from_camera). This extra prefixing is performed by a set of defines in "context_end.h". These will require explicit maintenance if entrypoints change. We invite discussion on more maintainable ways to enforce correctness.

Lambda expressions are not supported on MSL, so a new ccl_gpu_kernel_lambda macro generates an inline function object and optionally capturing any required state. This yields the same behaviour. This approach is applied to all parallel_... implementations which are templated by operation. The lambda expressions in the film_convert... kernels don't adapt cleanly to use function objects. However, these entrypoints can be macro-generated more concisely to avoid lambda expressions entirely, instead relying on constant folding to handle the pixel/channel conversions.

A separate implementation of gpu_parallel_active_index_array is provided for Metal to workaround some subtle differences in SIMD width, and also to encapsulate some required thread parameters which must be declared as explicit entrypoint function parameters.

Ref T92212

Diff Detail

Repository
rB Blender
Branch
master
Build Status
Buildable 18537
Build 18537: arc lint + arc unit

Event Timeline

Michael Jones (michael_jones) requested review of this revision.Nov 4 2021, 2:01 PM
Michael Jones (michael_jones) created this revision.
Brecht Van Lommel (brecht) requested changes to this revision.Nov 5 2021, 8:02 PM

Looks generally fine.

This extra prefixing is performed by a set of defines in "context_end.h". These will require explicit maintenance if entrypoints change. We invite discussion on more maintainable ways to enforce correctness.

An additional macro could be added to wrap any functions calls, like:

ccl_gpu_kernel_call(integrator_init_from_camera)(nullptr, state, tile, render_buffer, x, y, sample);

It's not particularly pretty, but this kernel.h file is already full of macros, so might as well have another.

intern/cycles/kernel/device/metal/context_begin.h
24

I'm curious how well the compiler can optimize access to this.

With CUDA we can have have constant memory in program scope, which I think means that when accessing something like kernel_data.cam.aperturesize, that's a lookup at a fixed address without any indirection.

If this code was a compiled by a typical CPU compiler, I would expect there to be two indirections, both for this and launch_params_metal. If it's doing that on the GPU, it would be quite bad for performance.

Is the compiler inlining everything and optimizing this somehow, or are we really paying the price of two indirections?

If it helps this could also be passed as KernelGlobals, with this kind of typedef and replacing nullptr in a bunch of places in kernel.h.

typedef const KernelParamsMetal& KernelGlobals;
intern/cycles/kernel/device/metal/globals.h
30

What does NOT_PRECOMPILED mean?

intern/cycles/kernel/device/metal/parallel_active_index.h
67 ↗(On Diff #44362)

Could this implementation use matching variables names and ccl_gpu_ abstractions, even if it's metal specific?

From a maintenance point of view I think it's easier if this matches the generic parallel_active_index.h as closely as possible, so that's it's easier to make change in both and see exactly how they differ.

So that by that I mean simd_lane - >warp, simd_ballot -> ccl_gpu_ballot, threadgroup_barrier -> ccl_gpu_syncthreads, etc.

This revision now requires changes to proceed.Nov 5 2021, 8:02 PM
  • Add ccl_gpu_kernel_call macro to redirect into Metal context class
  • Remove redundant NOT_PRECOMPILED markup
  • Adapt existing gpu/parallel_active_index.h to work with MSL
Michael Jones (michael_jones) marked an inline comment as done.Nov 8 2021, 2:03 PM

I have adapted the existing "gpu/parallel_active_index.h" to work with MSL. The two main differences:

  1. Metal thread params need to be injected from the entrypoint functions, hence the helper struct for MSL
  2. A new macro ccl_gpu_thread_mask has been added to distinguish between simd widths, i.e. 32 on CUDA and HIP, and 64 for Apple GPUs
intern/cycles/kernel/device/metal/context_begin.h
24

Metal doesn't currently support program scope bindings, however there are a couple of possibilities for optimising this that we're actively exploring - dependency injection using something like KernelGlobals is one of them. Either way, it will require more extensive changes to common code, therefore we'd like to follow up on in a later optimisation pass following more detailed analysis.

intern/cycles/kernel/device/metal/globals.h
30

This is detritus from an older revision - removed!

Looks good, I tested that CUDA and OptiX still pass tests.

This revision is now accepted and ready to land.Nov 9 2021, 8:02 PM