Page MenuHome

Metal: Realtime compositor enablement with addition of GPU Compute.
ClosedPublic

Authored by Jason Fielder (jason_apple) on Thu, Jan 12, 6:16 PM.

Details

Summary

This patch adds support for compilation and execution of GLSL compute shaders. This, along with a few systematic changes and fixes, enable realtime compositor functionality with the Metal backend on macOS. A number of GLSL source modifications have been made to add the required level of type explicitness, allowing all compilations to succeed.

GLSL Compute shader compilation follows a similar path to Vertex/Fragment translation, with added support for shader atomics, shared memory blocks and barriers.

Texture flags have also been updated to ensure correct read/write specification for textures used within the compositor pipeline. GPU command submission changes have also been made in the high level path, when Metal is used, to address command buffer time-outs caused by certain expensive compute shaders.

Authored by Apple: Michael Parkin-White

Ref T96261
Ref T99210

Diff Detail

Repository
rB Blender

Event Timeline

Jason Fielder (jason_apple) requested review of this revision.Thu, Jan 12, 6:16 PM
Jason Fielder (jason_apple) created this revision.

I have included everything needed in this patch for the features to work correctly and for all compilations to succeed, however, if it helps to split out some of the high level shader source modifications to a separate patch, that's no problem!

Just wanted to ensure everything was included here for context.

I'm also happy to discuss better ways of handling a number of the source transformations, particularly passing fixed-sized array references as function parameters.

Does this also help with hair/curves? (since that uses compute shaders as well iirc)

Does this also help with hair/curves? (since that uses compute shaders as well iirc)

Currently, the Metal backend still uses the Transform Feedback vertex-shader path for hair and curve processing. While this patch adds compute support, SSBO (Storage Buffer) support is still needed. So, currently, only compute shaders which output to images can be used.

As hair requires a storage buffer for outputting result, this path will still not yet be taken. Though this is something I'm actively working on and hope to get in soon. However, from a performance standpoint, there should not be a massive difference, beyond opportunities for compute work to be better pipelined.

Might be unrelated, just dropping the note here:

hello. i just checked on my mac pro 2013, mac os 12.6.2, on which the workaround stopped working some months ago and it crashes with blender 3.3.3-candidate and 3.4.1.
but not with 3.5.0-alpha, when the ui is switched to metal. :)

I'm not entirely sure how you determine if you need a compute command encoder or not. Is it just based on the latest draw/dispatch function invoked?

source/blender/draw/engines/compositor/compositor_engine.cc
257

Maybe a silly question but, can't we flush instead to also split the command buffer. Or does the time-out error affects all command buffers in queue?

source/blender/draw/engines/eevee_next/shaders/eevee_film_cryptomatte_post_comp.glsl
5–9

Would using a struct instead fix the issue of the reference syntax ?

struct CryptoMatteSamples { vec2 samples[CRYPTOMATTE_LEVELS_MAX] };
source/blender/gpu/metal/mtl_command_buffer.mm
515–517

Readability.

source/blender/gpu/metal/mtl_shader_generator.mm
409–411

I think it would be better to just add them to the shader interface instead. I can accept this workaround for now but it should be added as a TODO that this is a temporary workaround.

1292–1293

Why do you need to rescan if source was already scanned by GPUSource constructor?

1349–1350

Even if using a constexpr uint3? This feels like a really dirty workaround and it wont work for gl_WorkGroupSize[0].

source/blender/gpu/metal/mtl_shader_interface.mm
65–67

It does raise the question as to why not just add sampler_argument_buffer_bind_index_comp_? I don't think saving space inside MTLShaderInterface is critical.

Clément Foucault (fclem) requested changes to this revision.Mon, Jan 23, 10:46 AM
This revision now requires changes to proceed.Mon, Jan 23, 10:46 AM

Thanks for the review!
Will address the key points and submit a follow-up version of this.

source/blender/draw/engines/compositor/compositor_engine.cc
257

Certainly not a silly question, flushing would be ideal and is what I had originally tried and hoped would resolve the problem. Theoretically, this issue should not exist in this form as compute kernels have a very very long time out.

The caveat however comes from subsequent rendering work, which is dependent on the results of compute. The issue with how the dependency tracking works here is that certain parts of the render workload can begin, as they are not dependent on the compute work, but rendering of the results of the compositor is dependent.

As such, it is the stalled graphics work which is timing out (as graphics has quite a short timeout to promote app responsiveness). The problem with flushing the command buffer is that it still allows subsequent work to become queued up behind, and when the time delta of the CPU submission to GPU execution gets too large, the command buffer can be aborted. Especially if part of its work has entered the GPU job queue.

Stalling the pipe is certainly not ideal, however it may take some re-architecting to workaround this issue, if compositing execution continues to cause these timeouts.

Cycles live viewport rendering avoids this as the Cycles compute work is decoupled into a separate command queue and texture updates occur indirectly as results become available.

A similar approach could work for the compositor in future, though if kernels are optimized and performance is reasonable, then this is also a non issue. I've only been able to reproduce this with the blur kernel, due to the nested loop and significant number of texture samples causing ~0.5 FPS on the M1 Pro GPU.


The specific cause of the stall in this case also relates to specified Metal events which chain command buffers submitted by the Metal backend, to ensure execution of command buffers happens in order.

(See encodeWaitForEvent and encodeSignalEvent in mtl_command_buffer.mm and mtl_context.mm)

Without these, while the GPU will not stall, the viewport will flicker due to possible out-of-order resource updates. I.e. a pass may press on with the GPU work, but use a non-updated version of a texture. Dependency tracking between passes will only automatically synchronise a dependency if the update is already in flight.


Apologies for the long ramble, brain dump from me as well, but just wanting to shed some light on the execution model within the backend.

One other possible solution is to selectively disable the synchronisation between subsequent command buffers, such that the compute work executes in isolation, and the viewport will then display the old results if new ones are not yet ready. Though this will likely introduce flickering, and could also cause problems with the workload submission pipe if compositor work is decoupled from the render.

In this case, it's also worth mentioning that if the GPU_finish call was isolated to known expensive kernels, then stalling the CPU does not necessarily degrade performance all that much, as the system is already heavily GPU bound in these scenarios anyway.

source/blender/draw/engines/eevee_next/shaders/eevee_film_cryptomatte_post_comp.glsl
5–9

yes, using struct syntax would allow the conventional out type name syntax replacement to work. Pass by reference detection and replacement for array types is a little more expensive and increases the cost of GLSL -> MSL translation.

I can potentially add this, though the existing replacement is able to just perform character replacement, which is cheap. Given the syntax requirements for pass by reference of arrays, it may be a little more complex.

replace_outvars in mtl_shader_generator handles this and https://developer.blender.org/D17033 optimizes a number of these functions, so the overhead may not be as significant with the optimizations to the original replacement function.

Could possibly get the replacement code to replace the following:

out vec2 samples[CRYPTOMATTE_LEVELS_MAX]

with:

#define OUT(type, name, array) thread type(&name)[array]
OUT(vec2,samples,CRYPTOMATTE_LEVELS_MAX)

or

THD(vec2,samples, CRYPTOMATTE_LEVELS_MAX)

That should still allow for character replacement (without resizing the string), and should be fairly straightforward to implement. So long as there is no risk of Macro reuse.

source/blender/gpu/metal/mtl_shader_generator.mm
409–411

Sounds good, will add the TODO for now, and look into moving these.

1292–1293

Good point, may have had this in for mistaken legacy reasons during port to GPUShaderCreateInfo. Recall a few cases where a shader would use a flag, but the bits were not set. Though this may have been specific to gl_FragDepth, and previously missing depth_write declarations. Will remove.

1349–1350

Can experiment with constexpr, there were some frustrating limitations and errors from the compiler which made this awkward to work with. I.e. things that should be considered static at compile time not being treated as such.

Though will look for a more suitable implementation.

source/blender/gpu/metal/mtl_shader_interface.mm
65–67

Yeah, likely best to just use an extra parameter. The original thinking was to avoid diverging the functions which fetch the offset, but likely introducing more confusion than solving.

This revision is now accepted and ready to land.Mon, Jan 30, 10:59 AM