Page MenuHome

RFC: Halide for multi-purpose image processing in Blender
Needs ReviewPublic

Authored by Omar Emara (OmarSquircleArt) on Oct 28 2021, 1:29 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This is not an actual patch and the code should merely be considered
supplementary to the RFC for demonstrative and investigative purposes.

Motivation

As part of the Realtime Viewport Compositor project (T91293), it is required to
rewrite most of the compositor operations in GLSL (T91296) for utilization by
the realtime compositor engine. Such great undertake begs the question of of its
necessary and whether an alternative solution exists. Furthermore, similar
requirements appears in various projects that would also require writing
different variants of the same compositor operations. Namely, the compositor
improvement plan (T74491) considers SIMD vectorization of operations and the
possibility of having GPU kernels for all operations. Other modules also share
similar requirements (T65720). Presented solutions (T81650) for consolidating
and unifying all of those requirements usually involve manual techniques such as
the ones utilized by Cycles' kernel implementation.

This RFC proposes and evaluates Halide as a possible alternative.

Halide

Halide (https://halide-lang.org/) is a framework for writing highly efficient
and portable image processing routines. It can target many ISAs and compute APIs
including X86, ARM, OpenCL, OpenGL, and more. Notably, it decouples the image
processing algorithm from its schedule, meaning that the distribution of work
over the domain of the image is independently defined from the actual
computations performed on the pixels. This can be seen in the code below.

A Halide Generator is an object that defines an image processing operation, its
schedule, its inputs and outputs, and how it can be compiled. The generator can
then be compiled producing a static library containing a routine that executes
the generator given the inputs and outputs as well as a header file that defines
that routine. For instance, the Gamma generator below produces the following
header:

int gamma(struct halide_buffer_t *input_buffer,
          struct halide_buffer_t *gamma_buffer,
          struct halide_buffer_t *output_buffer);

That's the only entry function that needs to be called from the user code, the
user just constructs the right buffers and calls this function.

Each of the following sections consider Halide from the perspective of a certain
application.

Halide For The Compositor

Vectorization

One of the shortcomings of the current compositor implementation is its under
utilization of SIMD vectorization. One could write multiple implementations and
choose the right one at runtime through a CPU dispatcher, but this is too much
work and unnecessary in most cases. Alternatively, one can use SIMD-backed
vector types for doing computations. Such vectorization can happen pixel-wise,
possibly to ease consolidation with GPU kernels. However, this have the
disadvantage of using narrower vectors lanes in some cases even when the
hardware supports wider vectors, which is a problem we faced when vectorizing
CPU kernels in Cycles. The other approach is to vectorize along the minor axis
of the image, using vectors that are as wide as possible, which can complicate
the code due to its many edge cases and the interleaved structure of the pixels.
So there is no unified and easy way to implement vectorization without a lot of
work and variation.

Halide on the other hand makes vectorization very easy. When compiling the
generator, one can specify multiple targets as follows:

add_halide_library(gamma
  FROM generatorsExecutable
  TARGETS x86-64-linux-avx2 x86-64-linux-avx x86-64-linux-sse41 x86-64-linux
)

Halide would then compile a variant for each target and link it with a wrapper
that dispatches the most suitable variant at runtime based on the CPU features.
Nothing changes as far as the user is concerned. In terms of defining the
vectorization, the Gamma generator is scheduled as follows:

const int vector_size = natural_vector_size(output.type());
if (vector_size == 4) {
  output.vectorize(c, vector_size);
}
else {
  output.vectorize(x, vector_size);
  output.unroll(c);
}

If the largest hardware vector size is 4, we vectorize pixel-wise along the
color channels to avoid all the issues that comes with vectorization along the
minor axis. However, if the hardware have a vector size wider than 4, we exploit
that by vectorizing along the minor axis of the schedule and unroll the color
channel loop. So, as demonstrated, vectorization in Halide is very easy and
effortlessly specializeable.

Parallelization

As you might have noticed, we also utilize threaded parallel execution by
parallelizing the major axis as follows:

output.parallel(y);

So the compositor engine needn't parallelize execution directly. It can simply
delegate parallelism to the generator to parallelize in which ever way it sees
fit, thus simplifying the engine and improving modularity.

Calling The Generator

The generator is called by first wrapping the MemoryBuffer as a Halide buffer.
For a full frame buffer, the wrapping happens as follows:

HalideBuffer as_halide_buffer()
{
  switch (datatype_) {
    case DataType::Value:
      return HalideBuffer(buffer_, get_width(), get_height());
    case DataType::Vector:
    case DataType::Color:
      return HalideBuffer::make_interleaved(buffer_, get_width(), get_height(), num_channels_);
  }
}

And called as follows:

halide::gamma(input_color->as_halide_buffer(),
              input_gamma->as_halide_buffer(),
              output->as_halide_buffer());

The supplementary code inflates all inputs for simplicity, but an actual
implementation will need to handle the case for a single element, either using
a compile-time specialization, an appropriate boundary condition on a singular
buffer, or some other method.

GPU Acceleration

Another shortcoming of the current compositor is that it is mostly CPU bound
aside from some OpenCL accelerated operations. And even those accelerated
operations are mostly unused because one has to pay for the expensive memory
round trip between the host and GPU memory. Consequently, all operations needs
to be GPU accelerated for an apparent benefit. As mentioned before, to avoid
rewriting all operations from scratch, an implementation can be consolidated
with the existing implementation similar how Cycles' kernels are defined. Cycles
employs a number of preprocessor tricks to ensure compatibility with different
targets. While this approach arguably works for Cycles, it may not necessarily
work for the case in question, perhaps due to GLSL being one of the desired
targets. Moreover, an observer of the current state of Cycles would identify
this approach as something that should be avoided if possible.

Halide can compile generators to a number of compute API targets including
OpenCL, OpenGL, CUDA, Metal, and Direct X. It also provides the necessary
runtime for most of them. So offloading work to the GPU can easily be done by
calling the exact same function mentioned before, albeit with some care for the
state of the input buffers. The next section considers the OpenGL target as one
such target, but the same could be said about the rest of the targets thanks to
the unified interface provided by Halide.

Halide For The Viewport Compositor

The Realtime Viewport Compositor project (T91293) requires the reimplementation
of all compositor operations in GLSL to be utilized by the OpenGL backed
compositor engine, which is yet to be implemented. The plan is to dynamically
construct a number of fragment shaders that calls the GLSL implementations of
the compositor operations, similar to how shader node trees are implemented. Of
course, dynamic shaders are only concerned with pixel-wise operations and all
other operations would have to be implemented and invoked independently,
possibly invoking multiple shaders per operation.

The alternative is to let Halide generate the shaders and run it, which is
described in the following sections.

Halide OpenGL Backend

Halide can compile the generator into a GLSL compute shader and provide the
necessary runtime to invoke it. This is readily done by just targeting the
OpenGLCompute backend:

add_halide_library(gamma
  FROM generatorsExecutable
  TARGETS host-openglcompute
)

GPU targets will almost always need to be scheduled differently though, and
that's why we schedule the generator on the GPU as follows:

output.vectorize(c, natural_vector_size(output.type()));
output.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::RoundUp);

We first vectorize along the color channels to utilize the SIMD units of the
GPU, then we split the work domain into workgroups and local invocations to map
into the corresponding OpenGL constructs.

As mentioned before, calling the shader is identical to how you would call the
CPU variant. The only different is that the input buffers will be wrapping an
OpenGL object and assigned an appropriate device interface, so the buffers would
be constructed as follows:

#include <HalideRuntimeOpenGLCompute.h>

using HalideBuffer = Halide::Runtime::Buffer<float>;

HalideBuffer buffer = HalideBuffer::make_interleaved(nullptr, width, height, channelsCount);
buffer.device_wrap_native(halide_openglcompute_device_interface(), name);

Where name is the name of the OpenGL buffer containing the data. Halide
automatically inserts the necessary memory barriers, so invocations can be
chained in order to achieved the function of the compositor.

The OpenGL backend is one of the more limited backends in Halide, but for the
purpose of standard image processing, those limitations don't really matter. For
the last couple of days, I have been submitting patches to Halide in order to
fix, optimize, and relax limitations within the OpenGL backend, so there is
still room for improvements, and most importantly, upstream contributions.

Multi Stage Generators

As mentioned before, some image processing operations require two or more stages
in order to be implemented correctly or efficiently. By stage, we mean a shader
invocation with a coherent memory access to the buffers. And this is especially
hard to get right when implemented manually. Halide on other hand supports multi
stage generators transparently, so one could write a generator that compiles to
two or more shaders. Moreover, Halide can make smart decisions about the
necessary allocations across invocations producing more efficient code than a
standard scheduling.

Halide As A Dependency

Halide only requires LLVM and a C++14 capable compiler. Both of which are
already a dependency of Blender, so only libHalide would be gained as a
compile-time dependency. It is important to note that Blender wouldn't have a
run-time dependency on libHalide, the libraries produced by Halide are not
linked to libHalide, and all the necessary runtime is baked into the produced
libraries and the header files included in the user code.

In terms of maintainability, Halide is largely developed by Google and Adobe.
Adobe uses Halide in development of their software like Photoshop and Google
uses it for image processing tasks in their phone lineup. So it is safe to
assume Halide will be maintained in the future and it would count as a solid
dependency.

Halide In Development

As demonstrated before, Halide generators have a simple interface and very few
requirements. Consequently, for existing code, operations can be reimplemented
as Halide generators incrementally without affecting other operations. And more
importantly, if one decides that an operation is better implemented manually,
that can be accommodated just by matching the interface. For instance, one might
realize that the Halide FFT implementation on the CPU is less efficient than the
implementation provided in the FFTW3 library, in which case, it would make sense
to use the FFTW3 library instead, and one would be able to do that readily even
if all other operations are implemented as Halide generators.

Halide For Developers

Writing Halide generators can often be unfamiliar to new developers, especially
the schedule part, which is something that should be considered. Realistically,
however, the schedule will almost always be the same for the same class of
operations. For instance, all pixel-wise operations will likely use the same
schedule, so a new developer writing a new pixel-wise operation will not have to
interact with the scheduling part of the generator. So I suspect this will not
be much of a problem with sufficient abstraction.

Halide For General Processing

Perhaps not apparent or advertised, Halide can also be used in applications
other than image processing. For instance, it can be used in mesh processing or
noise evaluation to produce trivially vectorized and threaded computations,
perhaps even offload to accelerators if proved advantageous. But this is merely
a plus and should not be taken as a goal in the RFC.

Conclusion

There is a growing demand for a portable, efficient, and multi-purpose image
processing framework implementation in Blender. Halide is a one in all solution
to those requirements and I think it is worth serious consideration. An
integration in Blender would help existing functionalities, allow new
functionalities, and improve the general maintainability of the relevant modules
in Blender.

Diff Detail

Repository
rB Blender
Branch
halide-rfc (branched from master)
Build Status
Buildable 18288
Build 18288: arc lint + arc unit

Event Timeline

Omar Emara (OmarSquircleArt) requested review of this revision.Oct 28 2021, 1:29 PM
Omar Emara (OmarSquircleArt) created this revision.

I've played with this a while a while ago, these opinions are based on its state 2ish years ago, some thing may have improved, then again they also may not have.

pros:

  • It's a mature project, with active maintenance
  • Being based on llvm, halide generates very good code, with excellent runtime performance
  • Can build kernels at compile time or runtime
  • Supports many architectures (cpu/gpu) from a single source and can decide at runtime what kernel to launch (it is like cycles in this regard, builds SSE2/4/AVX/etc kernels picks the best at runtime)
  • Python bindings which may allow plugins to add new operations to the compositor with great performance
  • Supports all platforms we target, and even the platforms we don't target but distro's downstream from us do (ie mips/powerpc/risc)
  • There's really not many alternatives to halide, intel's ISPC comes to mind, but it comes nowhere close to the broad architecture support halide has

cons:

  • Being based on llvm, halide generates very good code but it "takes a bit" doing this at runtime to take advantage of its fusion stages will unlikely have satisfactory performance.
  • Essentially binary object code output only, it can output C++ but even that doesn't seem to be meant for "human consumption"
  • poor debugging story, when it does something you don't expect its hard trying to figure out why
  • Will re-order floating point operations even when you ask it not to #3813, I haven't checked recently but I reported it in 2019 and it seems still open, so I assume it still has this behavior
  • Scheduling has a fair bit of "black magic" to it, a schedule that works for one architecture may not work quite as well for another also takes a fair bit of experience to get the hang of it
  • It's a bit of a monster to build from source , large too, shared version was around ~40 megs if I remember correctly

Elephant in the room:

  • What do we do when halide is unavailable, disable the compositor or do we have fallback implementation for these operations?

Halide makes me uneasy, it's great when it works, but when it doesn't oh boy....

Also some performance numbers on the operation you ported may be nice, performance is halide's nr1 selling point, leaving it out of this diff seems... strange...

Being based on llvm, halide generates very good code but it "takes a bit" doing this at runtime to take advantage of its fusion stages will unlikely have satisfactory performance.

We will be ahead of time compiling all generators so the JIT and runtime compilation performance probably shouldn't be an issue.

Essentially binary object code output only, it can output C++ but even that doesn't seem to be meant for "human consumption".

If you are after debugging, the STMT output seems reasonably readable. But yes, debugging auto generated code is definitely not the best thing one can do. Hopefully it won't get to that.

Will re-order floating point operations even when you ask it not to #3813, I haven't checked recently but I reported it in 2019 and it seems still open, so I assume it still has this behavior.

I will look into that.

Scheduling has a fair bit of "black magic" to it, a schedule that works for one architecture may not work quite as well for another also takes a fair bit of experience to get the hang of it.

Right. That's what I mentioned in "Halide For Developers". The reason why I think this isn't a big issue is the fact that a schedule will be shared by many operations of the same class. So spending a good amount of time finding the best schedule for all pixel-wise operations won't be unreasonable. Moreover, it is likely that a trivial schedule will be analogous to the existing "loops" we have, and any further optimizations of the schedule will be a plus.

It's a bit of a monster to build from source , large too, shared version was around ~40 megs if I remember correctly.

Shared version is around 28M, but that is only a compile-time dependency, so an extra 28M on a developer's system will not be an issue.

What do we do when halide is unavailable, disable the compositor or do we have fallback implementation for these operations?

That's what I talked about in "Halide In Development", if proved that a manual implementation is better, we just use the manual implementation. Just like we did in this supplementary code, only one operation is implemented as a Halide generator, the rest is manually implemented.

Also some performance numbers on the operation you ported may be nice, performance is halide's nr1 selling point, leaving it out of this diff seems... strange...

I probably don't agree that performance is Halide's number 1 selling point. That would be portability and ease of development for me. We can always optimize things to get close to Halide's performance. However, what gets to me is writing the same code many times and investing in optimizations that may otherwise be easy through an alternative method. And that's what I wanted to show in this diff.
I will try to provide some numbers later.

I'm very currious to see what is the resulting GLSL code. Debugging big nodetree without any info on which node the code came from sure is going to be fun! Also if digesting the nodetree is slow then it might be an issue with animated parameters or user interactivity. The current way of dealing with updates is just to rebuild the whole tree.

Ah and it seems that the GL backend of halide is compute based only. That would rule out any system that does not support compute. We might get to this point into the near future (maybe 3.1) where we require compute shader capability (we are waiting for a Metal backend to lift the GL 3.3 limitation we have on MacOS). So this is something to keep in mind.

I do agree that scheduling and multi-pass auto handling is very attractive.

Think the opengl support is for compute purposes only, from a compositor nodetree, you wouldn't care all that much where the computation happens, you just call a method and when it returns the result sits in your buffers, if it used CPU/CUDA/DX12 or OpenGL in the backend is mostly abstracted away from you.

I'm very currious to see what is the resulting GLSL code.

I will compile and attach an example code in a moment.

Debugging big nodetree without any info on which node the code came from sure is going to be fun! Also if digesting the nodetree is slow then it might be an issue with animated parameters or user interactivity. The current way of dealing with updates is just to rebuild the whole tree.

To clarify, we we will not be compiling the whole node tree as a unit. I envision the system as follows:

  1. Each compositing operation/node is compiled into its own generator. Halide merely gives us a function that takes the input buffers and parameters and writes to the output buffers, performing the function of this operation.
  2. The compositor engines invokes a compiler that compiles the node tree into a simple IR and most importantly decides on the necessary intermediate buffer allocations. So it is essentially analogous to a register allocation compiler optimization problem.
  3. The compositor engines invokes an interpreter that actually performs the necessary buffer allocations and consume the produced IR, invoking the Halide generators in the process. Each IR statement will be either an allocation or an invocation of a Halide generator.

So in terms of debugging, a failed node tree can either fail due to a badly produced IR, which should be easy to debug, or it could be a malformed generator, which can be easily narrowed down and debugged independently.

In terms of interactivity, no code generation happens at runtime and no rebuilding of the node tree is necessary. This superior interactivity is one of the advantages of this approach. The possible disadvantage here, however, is that if a large chain of single stage operations exist sequentially in the node tree, they will be scheduled as separate shader invocations, while in a system of dynamic shader generation, they will be consolidated into a single shader and invoked as a single shader. I looked into the cost of invoking multiple shaders sequentially as opposed to merging their shaders and invoking the shader once, and I didn't find significant slowdowns in sequential invocations, so I suspect this will not be much of a problem. Regardless, if we identify better performance in shader consolidation, we can do this in a later stage as part of the node tree compilation step.

Ah and it seems that the GL backend of halide is compute based only.

Halide actually had a "fragment shader" OpenGL backend, but it was removed earlier this year due to various limitations and problems with it. So yes, we are left with compute shaders only.

That would rule out any system that does not support compute. We might get to this point into the near future (maybe 3.1) where we require compute shader capability (we are waiting for a Metal backend to lift the GL 3.3 limitation we have on MacOS).

I saw some compute shaders in the source and the OpenSubdiv GPU evaluator being compute based, so I assumed Blender is going in that direction anyways.
I am currently working on reducing the version requirements of the OpenGL backend in Halide and also investigating an Image Load/Store runtime to further widen support.

I looked into the cost of invoking multiple shaders sequentially as opposed to merging their shaders and invoking the shader once, and I didn't find significant slowdowns in sequential invocations, so I suspect this will not be much of a problem.

I highly doubt this is true (but would be happy to be proven wrong). Having separate shader passes means more roadtrip to gpu memory which is a major bottleneck in many situation.

So in terms of debugging, a failed node tree can either fail due to a badly produced IR, which should be easy to debug, or it could be a malformed generator, which can be easily narrowed down and debugged independently.

I was more thinking of debugging specific GL drivers/implementations bugs. For these, having access to readable GLSL is key. At least for me. Sometime I have access to computers where I just drop blender and renderdoc and I can debug things. Having the separated stages might help though.

About shaders management, is everything abstracted by Halide? Do we just do Halide::compute_this(node, image_buffer_in, image_buffer_out) ?

I highly doubt this is true (but would be happy to be proven wrong).

I will make a test program. But what data are we talking about in "roadtrip to gpu memory"? The buffers are already in the GPU memory, so isn't the only thing going from host to GPU are small things like uniforms.

For these, having access to readable GLSL is key.

Yah, GLSL is somewhat redable, I will attach an example later.

About shaders management, is everything abstracted by Halide? Do we just do Halide::compute_this(node, image_buffer_in, image_buffer_out)?

Yes. I will demonstrate that in the test program.

But what data are we talking about in "roadtrip to gpu memory"? The buffers are already in the GPU memory, so isn't the only thing going from host to GPU are small things like uniforms.

Doing texture or buffer read/write operation is one of the major bottleneck on modern GPU caused by the latency and the throughput of the GDDR memory. GPUs have various ways of reducing the cost. Having on core caches is one of them (just like a CPU does) and hope for coherent memory accesses by "neighbor" threads so the cache can be put to good use.
Doing multiple passes means having to move the data back and forth from GDDR to VGPR or SGPR registers multiple times for processing simple operations. This would have somewhat the same efficiency as rendering a lot of fullscreen triangles with transparency.
This might feel ok for a handful of them but you will be able to tell the difference already at 30 passes or so. Also this will not scale well with resolution.

Also consider that the compositor is done after all the scene has finished drawing. So any ineficiency will add up to the scene complexity.

Doing multiple passes means having to move the data back and forth from GDDR to VGPR or SGPR registers multiple times for processing simple operations.

Sounds reasonable. I guess I was naively measuring the cost of program change and invocation.

This is my first evaluation of the RFC, concentrating on the use of Halide for
the viewport compositor.

OpenGLCompute Backend Is Immature

As discussed before, the OpenGLCompute backend is one of the more limited
backends in Halide. One of those limitations is the fact that scheduling
vectorization is currently not possible, so inner loops will likely need to be
unrolled instead. This would under utilize the vector units and GPUs and thus
handwritten shaders would most likely be better. So I enabled vectorization
through a series of patches to upstream Halide, and while that worked for some
shaders, further testing proved that some parts of Halide operates under the
assumption of scalar operations. So more complex shaders started failing. While
I already fixed all of those issues, this is a sign that the OpenGLCompute
backend is not mature enough to be used in production yet. The backend will need
to be developed and tested independently upstream, and only then it can be
reevaluated and considered again.

SSBO And Images

Currently, Halide uses SSBOs as the backing storage for their inputs and
outputs. This has some implications:

  • This might mean that some pixel packing and unpacking will need to be done because images will usually be stored in texture objects and not buffers.
  • SSBO are less portable than images. For instance, the R600 driver in Mesa currently does not work properly with SSBO but works fine with images.

There is unlikely to be any performance differences, but I haven't considered
or tested this at the moment.