This is not an actual patch and the code should merely be considered
supplementary to the RFC for demonstrative and investigative purposes.
Motivation
As part of the Realtime Viewport Compositor project (T91293), it is required to
rewrite most of the compositor operations in GLSL (T91296) for utilization by
the realtime compositor engine. Such great undertake begs the question of of its
necessary and whether an alternative solution exists. Furthermore, similar
requirements appears in various projects that would also require writing
different variants of the same compositor operations. Namely, the compositor
improvement plan (T74491) considers SIMD vectorization of operations and the
possibility of having GPU kernels for all operations. Other modules also share
similar requirements (T65720). Presented solutions (T81650) for consolidating
and unifying all of those requirements usually involve manual techniques such as
the ones utilized by Cycles' kernel implementation.
This RFC proposes and evaluates Halide as a possible alternative.
Halide
Halide (https://halide-lang.org/) is a framework for writing highly efficient
and portable image processing routines. It can target many ISAs and compute APIs
including X86, ARM, OpenCL, OpenGL, and more. Notably, it decouples the image
processing algorithm from its schedule, meaning that the distribution of work
over the domain of the image is independently defined from the actual
computations performed on the pixels. This can be seen in the code below.
A Halide Generator is an object that defines an image processing operation, its
schedule, its inputs and outputs, and how it can be compiled. The generator can
then be compiled producing a static library containing a routine that executes
the generator given the inputs and outputs as well as a header file that defines
that routine. For instance, the Gamma generator below produces the following
header:
int gamma(struct halide_buffer_t *input_buffer,
struct halide_buffer_t *gamma_buffer,
struct halide_buffer_t *output_buffer);That's the only entry function that needs to be called from the user code, the
user just constructs the right buffers and calls this function.
Each of the following sections consider Halide from the perspective of a certain
application.
Halide For The Compositor
Vectorization
One of the shortcomings of the current compositor implementation is its under
utilization of SIMD vectorization. One could write multiple implementations and
choose the right one at runtime through a CPU dispatcher, but this is too much
work and unnecessary in most cases. Alternatively, one can use SIMD-backed
vector types for doing computations. Such vectorization can happen pixel-wise,
possibly to ease consolidation with GPU kernels. However, this have the
disadvantage of using narrower vectors lanes in some cases even when the
hardware supports wider vectors, which is a problem we faced when vectorizing
CPU kernels in Cycles. The other approach is to vectorize along the minor axis
of the image, using vectors that are as wide as possible, which can complicate
the code due to its many edge cases and the interleaved structure of the pixels.
So there is no unified and easy way to implement vectorization without a lot of
work and variation.
Halide on the other hand makes vectorization very easy. When compiling the
generator, one can specify multiple targets as follows:
add_halide_library(gamma FROM generatorsExecutable TARGETS x86-64-linux-avx2 x86-64-linux-avx x86-64-linux-sse41 x86-64-linux )
Halide would then compile a variant for each target and link it with a wrapper
that dispatches the most suitable variant at runtime based on the CPU features.
Nothing changes as far as the user is concerned. In terms of defining the
vectorization, the Gamma generator is scheduled as follows:
const int vector_size = natural_vector_size(output.type());
if (vector_size == 4) {
output.vectorize(c, vector_size);
}
else {
output.vectorize(x, vector_size);
output.unroll(c);
}If the largest hardware vector size is 4, we vectorize pixel-wise along the
color channels to avoid all the issues that comes with vectorization along the
minor axis. However, if the hardware have a vector size wider than 4, we exploit
that by vectorizing along the minor axis of the schedule and unroll the color
channel loop. So, as demonstrated, vectorization in Halide is very easy and
effortlessly specializeable.
Parallelization
As you might have noticed, we also utilize threaded parallel execution by
parallelizing the major axis as follows:
output.parallel(y);
So the compositor engine needn't parallelize execution directly. It can simply
delegate parallelism to the generator to parallelize in which ever way it sees
fit, thus simplifying the engine and improving modularity.
Calling The Generator
The generator is called by first wrapping the MemoryBuffer as a Halide buffer.
For a full frame buffer, the wrapping happens as follows:
HalideBuffer as_halide_buffer()
{
switch (datatype_) {
case DataType::Value:
return HalideBuffer(buffer_, get_width(), get_height());
case DataType::Vector:
case DataType::Color:
return HalideBuffer::make_interleaved(buffer_, get_width(), get_height(), num_channels_);
}
}And called as follows:
halide::gamma(input_color->as_halide_buffer(),
input_gamma->as_halide_buffer(),
output->as_halide_buffer());The supplementary code inflates all inputs for simplicity, but an actual
implementation will need to handle the case for a single element, either using
a compile-time specialization, an appropriate boundary condition on a singular
buffer, or some other method.
GPU Acceleration
Another shortcoming of the current compositor is that it is mostly CPU bound
aside from some OpenCL accelerated operations. And even those accelerated
operations are mostly unused because one has to pay for the expensive memory
round trip between the host and GPU memory. Consequently, all operations needs
to be GPU accelerated for an apparent benefit. As mentioned before, to avoid
rewriting all operations from scratch, an implementation can be consolidated
with the existing implementation similar how Cycles' kernels are defined. Cycles
employs a number of preprocessor tricks to ensure compatibility with different
targets. While this approach arguably works for Cycles, it may not necessarily
work for the case in question, perhaps due to GLSL being one of the desired
targets. Moreover, an observer of the current state of Cycles would identify
this approach as something that should be avoided if possible.
Halide can compile generators to a number of compute API targets including
OpenCL, OpenGL, CUDA, Metal, and Direct X. It also provides the necessary
runtime for most of them. So offloading work to the GPU can easily be done by
calling the exact same function mentioned before, albeit with some care for the
state of the input buffers. The next section considers the OpenGL target as one
such target, but the same could be said about the rest of the targets thanks to
the unified interface provided by Halide.
Halide For The Viewport Compositor
The Realtime Viewport Compositor project (T91293) requires the reimplementation
of all compositor operations in GLSL to be utilized by the OpenGL backed
compositor engine, which is yet to be implemented. The plan is to dynamically
construct a number of fragment shaders that calls the GLSL implementations of
the compositor operations, similar to how shader node trees are implemented. Of
course, dynamic shaders are only concerned with pixel-wise operations and all
other operations would have to be implemented and invoked independently,
possibly invoking multiple shaders per operation.
The alternative is to let Halide generate the shaders and run it, which is
described in the following sections.
Halide OpenGL Backend
Halide can compile the generator into a GLSL compute shader and provide the
necessary runtime to invoke it. This is readily done by just targeting the
OpenGLCompute backend:
add_halide_library(gamma FROM generatorsExecutable TARGETS host-openglcompute )
GPU targets will almost always need to be scheduled differently though, and
that's why we schedule the generator on the GPU as follows:
output.vectorize(c, natural_vector_size(output.type())); output.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::RoundUp);
We first vectorize along the color channels to utilize the SIMD units of the
GPU, then we split the work domain into workgroups and local invocations to map
into the corresponding OpenGL constructs.
As mentioned before, calling the shader is identical to how you would call the
CPU variant. The only different is that the input buffers will be wrapping an
OpenGL object and assigned an appropriate device interface, so the buffers would
be constructed as follows:
#include <HalideRuntimeOpenGLCompute.h> using HalideBuffer = Halide::Runtime::Buffer<float>; HalideBuffer buffer = HalideBuffer::make_interleaved(nullptr, width, height, channelsCount); buffer.device_wrap_native(halide_openglcompute_device_interface(), name);
Where name is the name of the OpenGL buffer containing the data. Halide
automatically inserts the necessary memory barriers, so invocations can be
chained in order to achieved the function of the compositor.
The OpenGL backend is one of the more limited backends in Halide, but for the
purpose of standard image processing, those limitations don't really matter. For
the last couple of days, I have been submitting patches to Halide in order to
fix, optimize, and relax limitations within the OpenGL backend, so there is
still room for improvements, and most importantly, upstream contributions.
Multi Stage Generators
As mentioned before, some image processing operations require two or more stages
in order to be implemented correctly or efficiently. By stage, we mean a shader
invocation with a coherent memory access to the buffers. And this is especially
hard to get right when implemented manually. Halide on other hand supports multi
stage generators transparently, so one could write a generator that compiles to
two or more shaders. Moreover, Halide can make smart decisions about the
necessary allocations across invocations producing more efficient code than a
standard scheduling.
Halide As A Dependency
Halide only requires LLVM and a C++14 capable compiler. Both of which are
already a dependency of Blender, so only libHalide would be gained as a
compile-time dependency. It is important to note that Blender wouldn't have a
run-time dependency on libHalide, the libraries produced by Halide are not
linked to libHalide, and all the necessary runtime is baked into the produced
libraries and the header files included in the user code.
In terms of maintainability, Halide is largely developed by Google and Adobe.
Adobe uses Halide in development of their software like Photoshop and Google
uses it for image processing tasks in their phone lineup. So it is safe to
assume Halide will be maintained in the future and it would count as a solid
dependency.
Halide In Development
As demonstrated before, Halide generators have a simple interface and very few
requirements. Consequently, for existing code, operations can be reimplemented
as Halide generators incrementally without affecting other operations. And more
importantly, if one decides that an operation is better implemented manually,
that can be accommodated just by matching the interface. For instance, one might
realize that the Halide FFT implementation on the CPU is less efficient than the
implementation provided in the FFTW3 library, in which case, it would make sense
to use the FFTW3 library instead, and one would be able to do that readily even
if all other operations are implemented as Halide generators.
Halide For Developers
Writing Halide generators can often be unfamiliar to new developers, especially
the schedule part, which is something that should be considered. Realistically,
however, the schedule will almost always be the same for the same class of
operations. For instance, all pixel-wise operations will likely use the same
schedule, so a new developer writing a new pixel-wise operation will not have to
interact with the scheduling part of the generator. So I suspect this will not
be much of a problem with sufficient abstraction.
Halide For General Processing
Perhaps not apparent or advertised, Halide can also be used in applications
other than image processing. For instance, it can be used in mesh processing or
noise evaluation to produce trivially vectorized and threaded computations,
perhaps even offload to accelerators if proved advantageous. But this is merely
a plus and should not be taken as a goal in the RFC.
Conclusion
There is a growing demand for a portable, efficient, and multi-purpose image
processing framework implementation in Blender. Halide is a one in all solution
to those requirements and I think it is worth serious consideration. An
integration in Blender would help existing functionalities, allow new
functionalities, and improve the general maintainability of the relevant modules
in Blender.