Page MenuHome

Cycles: Add support for using OSL with OptiX
ClosedPublic

Authored by Patrick Mours (pmoursnv) on Sep 6 2022, 10:51 PM.
Tags
None
Tokens
"100" token, awarded by DerivedC."Party Time" token, awarded by pablovazquez."Love" token, awarded by damian."Love" token, awarded by dingto."Love" token, awarded by Snudl."Love" token, awarded by wilBr."Love" token, awarded by jc4d."Love" token, awarded by Alaska."Love" token, awarded by sanek2005."Love" token, awarded by YAFU."Love" token, awarded by silex."Love" token, awarded by mistaed."Love" token, awarded by brecht."Love" token, awarded by Raimund58.

Details

Summary

This idea has been floating around for some time now, since OSL technically supports generation of PTX that can be consumed by OptiX, so this patch realizes that in Cycles. There were a few challenges to the actual implementation, since the OSL integration in Cycles was rather CPU focused with use of CPU pointers and globals everywhere. As such this patch includes deeper changes to the OSL integration, particularly to how closures, textures and attributes are handled:

  • OSL closures are now generated using macros and a template file (see kernel/osl/closures_template.h, similar to how other data structures in Cycles are generated), so that no code duplication is necessary for CPU/GPU. The setup code for all of them was moved to standalone functions (see kernel/osl/closures_setup.h) and the conversion/flattening of the OSL closure tree to Cycles closures is now handled in a new loop using a stack instead of the previous recursive function using virtual function calls (see kernel/osl/osl.h).
  • Texture handles for the GPU are just SVM texture IDs, on the CPU the existing pointers are used.
  • For attributes I replaced the additional OSL attribute map and instead modified the SVM attribute map slightly so that it can be used with OSL too (instead of using a separately generated ID for lookup, the attribute name hash is used). This makes it easy to use on the GPU (since OSL already handles strings as hashes there) and also reduces memory usage.
  • The render services and shader manager were changed to no longer be shared across multiple Cycles instances. Unfortunately this is necessary, since the render services specify whether OSL should generate PTX or CPU code, so they have to be specific to a device (in case one session uses OSL on a CPU device and another session uses it with OptiX).
  • A few OSL data structure declarations are duplicated in Cycles code (see kernel/osl/types.h), so that kernels can be built without needing the OSL headers.
  • Strings are a bit peculiar. As mentioned, in the PTX OSL generates, they are just hashes, but as of v1.11.17.0 those hashes are referenced using global variables that need to be linked in, rather than just being passed around as hash values (this is improved in v1.12.*-dev, but for now are still stuck with v1.11.17.0), so have to magic some PTX together on the fly for this purpose (see OSLShaderManager::load_kernels in scene/osl.cpp).
  • The implementation currently requires OptiX rather than CUDA on the GPU, but shading is normally done in CUDA. To solve this, a new kernel_osl.ptx file was added that contains all shading kernels for OptiX (and everything else needed for OSL). This also has the advantage that when no OSL is used, all the extra stuff needed for OSL does not need to be compiled (so no change to load times), and similarily when OSL is used, no need to compile SVM kernel code.

From a user perspective, all of this should be fully transparent. So one can just load a scene using OSL shaders or add OSL scripts into a material graph, enable the "Open Shading Language" checkbox that is now visible when "GPU Compute" is used with OptiX too, and things just work.
There are just a few caveats: Not every OSL built-in intrinsic is implemented for the GPU (yet, see services.cu, the full list can be found at https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/v1.11.17.0/src/liboslexec/builtindecl.h), textures are limited since OIIO cannot be used on the GPU, so they are piped through the SVM system instead when on the GPU (no tiles or fancy interpolation/blurring/...) and tracing rays from OSL is not implemented for OptiX.

I've tried to keep the implementation agnostic from OptiX where possible, so that in theory support for the other backends can be added in the future when/if OSL adds support for them with relative ease. OSL also made several improvements to its OptiX support since the v1.11.17 release that is currently in use in Blender. But unfortunately there has been no release version of that yet, so stayed with v1.11.17 for now and just grabbed a few selective fixes and added them to osl.diff (namely https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/commit/b8ee545347b357ff5228ffe529996b06831fcc6a, https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/commit/22105d298202b6908445d7d6c5a59e43b9936128 and changes to make it find Python). There is also the problem that building OSL needs the CUDA and OptiX SDKs for this, which I wasn't quite sure how to best integrate in the dependency build environment, so for now that requires a CUDA_TOOLKIT_ROOT_DIR and OPTIX_ROOT_DIR environment variable to be set. It might make sense to make this optional somehow and only enable OSL OptiX support in Cycles when an OSL build with OptiX support is detected.

Screenshots of OSL and OptiX in action (first showing a simple OSL script node and second that textures are working):

Diff Detail

Repository
rB Blender

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

OSL closures are now generated using macros and a template file (see kernel/osl/closures_template.h, similar to how other data structures in Cycles are generated), so that no code duplication is necessary for CPU/GPU. The setup code for all of them was moved to standalone functions (see kernel/osl/closures_setup.h) and the conversion/flattening of the OSL closure tree to Cycles closures is now handled in a new loop using a stack instead of the previous recursive function using virtual function calls (see kernel/osl/osl.h).

Can you submit this as a separate patch? Would be good to land this in master first and make this patch easier to review.

For attributes I replaced the additional OSL attribute map and instead modified the SVM attribute map slightly so that it can be used with OSL too (instead of using a separately generated ID for lookup, the attribute name hash is used). This makes it easy to use on the GPU (since OSL already handles strings as hashes there) and also reduces memory usage.

This looks like a nice simplification. Can you submit this as a separate patch as well?

The render services and shader manager were changed to no longer be shared across multiple Cycles instances. Unfortunately this is necessary, since the render services specify whether OSL should generate PTX or CPU code, so they have to be specific to a device (in case one session uses OSL on a CPU device and another session uses it with OptiX).

What are the implications of this? I guess the texture system is still shared, but I don't remember if the services and shader manager cache something important that we want to reuse between sessions, or if they are lightweight enough that this does not matter.

A few OSL data structure declarations are duplicated in Cycles code (see kernel/osl/types.h), so that kernels can be built without needing the OSL headers.

The static asserts make this not as bad, though Linux distributions compiling with different OSL versions could be a problem. There is a plan for OSL not to assume the memory layout of ShaderGlobals and then we may be able to make it access ShaderData directly instead, which would be more efficient.

For closures it's similarly inefficient, copying first to the closure pool and then converting that to our list of closures. I don't think that has an easy solution and would require changes in OSL to be as efficient as possible, but is interesting to think about how that conversion could be avoided.

But unfortunately there has been no release version of that yet, so stayed with v1.11.17 for now and just grabbed a few selective fixes and added them to osl.diff

A simpler way to do this would be to set the OSL version to a commit hash instead of a released version, like we do for sse2neon for example.

There is also the problem that building OSL needs the CUDA and OptiX SDKs for this

I wonder if OSL itself actually needs these, or if it's just testrender? I see some includes in liboslexec but it isn't clear to me which functions or data structures this is actually using, or if that's a remnant from OptiX 6 support or something.

Can you submit this as a separate patch? Would be good to land this in master first and make this patch easier to review.

https://developer.blender.org/D15917
Tried to isolate this to just the closures, so this does not include some of the other clean up from this patch, e.g. there are still separate flatten_*_closure_tree recursive functions). Tested this with a few scenes and they render the same as before.

This looks like a nice simplification. Can you submit this as a separate patch as well?

https://developer.blender.org/D15918

What are the implications of this? I guess the texture system is still shared, but I don't remember if the services and shader manager cache something important that we want to reuse between sessions, or if they are lightweight enough that this does not matter.

The shading system could reuse the generated code for the same shaders, but other than that I don't think this has a huge impact on memory usage (based on looking at OSL::pvt::ShadingSystemImpl).

The static asserts make this not as bad, though Linux distributions compiling with different OSL versions could be a problem. There is a plan for OSL not to assume the memory layout of ShaderGlobals and then we may be able to make it access ShaderData directly instead, which would be more efficient.

For closures it's similarly inefficient, copying first to the closure pool and then converting that to our list of closures. I don't think that has an easy solution and would require changes in OSL to be as efficient as possible, but is interesting to think about how that conversion could be avoided.

Yeah. I experimented a bit with creating OSL closures directly in the ShaderData::closure array and change the ShaderClosure signature to match that of OSL, but OSL::ClosureColor::MUL and OSL::ClosureColor::ADD somewhat complicate things. It was simpler for now to continue with a separate closure pool and do the conversion, since the performance seems to be alright (rendering the bmw scene with OSL vs. SVM with the current implementation is very close in render time).

A simpler way to do this would be to set the OSL version to a commit hash instead of a released version, like we do for sse2neon for example.

I just wasn't sure what the stance would be on using an unreleased OSL version, but taking a recent commit instead would certainly simplify a few things further (e.g. get rid of the awkward PTX string table generation).

I wonder if OSL itself actually needs these, or if it's just testrender? I see some includes in liboslexec but it isn't clear to me which functions or data structures this is actually using, or if that's a remnant from OptiX 6 support or something.

That's a good point. There are a few places in OSL source code where optix.h is still included ... but it's not actually used and builds just fine without those include statements too. So can just patch those out and drop the CUDA/OptiX SDK requirement.

Merged with master branch.

Could you separately submit the change that removes osl/shader.cpp and moves its code into various new and existing files? You could also include #ifdef __SVM__ related changes in that. That would make this patch quite a bit smaller still.

The shading system could reuse the generated code for the same shaders, but other than that I don't think this has a huge impact on memory usage (based on looking at OSL::pvt::ShadingSystemImpl).

Perhaps we could have one shared shading system per device type?

I also wonder if this works with CPU + GPU render now, or if we would need multiple shading systems for that?

I just wasn't sure what the stance would be on using an unreleased OSL version, but taking a recent commit instead would certainly simplify a few things further (e.g. get rid of the awkward PTX string table generation).

Given the release schedule I imagine we can't get this ready for 3.4, so if it's for 3.5 there's probably some release in time, and even if not an unreleased version is probably better that this string table stuff.

That's a good point. There are a few places in OSL source code where optix.h is still included ... but it's not actually used and builds just fine without those include statements too. So can just patch those out and drop the CUDA/OptiX SDK requirement.

Actually looking a bit closer, not sure we can get rid of the CUDA dependency for compiling bitcode? I'm wondering if for OSL it would make sense to be bit more GPU compiler agnostic and let the host application build the bitcode, though that might be a major change. It's not impossible for us to compile OSL with CUDA, but for our build pipeline and Linux distributions in the future it's rather complicated.

Removing the OptiX dependency perhaps is easy and could be submitted upstream.

In the next OSL TSC meeting I'll bring up this project, and maybe some of the open questions can be discussed.

Perhaps we could have one shared shading system per device type?

I also wonder if this works with CPU + GPU render now, or if we would need multiple shading systems for that?

Right, CPU + GPU didn't work. I've changed it to use one global shading system per device type now, which works with CPU + GPU. Was hoping to be able to create/free them in ccl::ShaderManager::device_update to avoid having to keep track of the device in the shader manager, but unfortunately that doesn't work with custom OSL scripts that are loaded before that (and need to be loaded into the shading system for every device type). So ccl::ShaderManager now saves which device it was created with.

Actually looking a bit closer, not sure we can get rid of the CUDA dependency for compiling bitcode? I'm wondering if for OSL it would make sense to be bit more GPU compiler agnostic and let the host application build the bitcode, though that might be a major change. It's not impossible for us to compile OSL with CUDA, but for our build pipeline and Linux distributions in the future it's rather complicated.

There is -nocudainc and -nocudalib for Clang to generate bitcode without the CUDA toolkit. But yeah, as of right now this doesn't build because some of the dependencies OSL pulls in are using CUDA headers (e.g. Imath including cuda_fp16.h). This might be unnecessary for the parts that are compiled to bitcode however, so core OSL could potentially be made independent from a CUDA toolkit too with some header shuffeling. Still need to investigate further.


  • Fixed OptiX PTX input error for materials with special characters in the name (they are invalid in the direct callable function names OSL generates, so need to filter them out)
  • Updated to newer OSL version which has improved GPU string support (couldn't update to top of tree though, since that introduced severe rendering artifacts, see comment in versions.cmake)
  • Improved OSL kernel loading times when using OptiX 7.5 by using multi-threading to compile them
  • Added support for simultaneous CPU and OptiX with OSL
  • Fixed crash due to missing shader eval kernels for OptiX with OSL
  • Fixed funky behavior in some scenes due to the OSL group data array being too small, causing data to be written into the adjacent closure pool (and added an error message for cases where it is still to small, so that it's obvious; but this does require an OSL modification to be able to retrieve the necessary size of that array, see groupdata_size patch in osl.diff)
Patrick Mours (pmoursnv) updated this revision to Diff 56295.EditedSep 29 2022, 7:19 PM

Building CUDA LLVM bitcode without the CUDA toolkit is not going to work. I managed to get it to compile, with the use of internal Clang headers to replace the toolkit ones and some forward declarations and hackery, but the output is not usable without linking it against the device bitcode library of the CUDA toolkit (libdevice.X.bc). And that can't be bypassed since OSL makes heavy use of the standard library (which is kinda the point of the OSL bitcode module, as it encapsulates that).

But then I realized, there is really no need to actually build that module ... All it does is provide some common OSL built-ins, that can just as well be moved into Cycles (since many had to be provided already anyway). And in Cycles we have access to the CUDA toolkit to build, so all is good. Even better, in the latest OSL source drop used, the OptiX specific code parts are included always, they are not technically hidden behind a build option anymore (even though the CMake still makes one think that). There is just one caveat: some parts of the code made assumptions about an existing OSL bitcode module and didn't work without, but that can be fixed with a simple patch of 4 lines.
As a result, the necessary changes to the OSL dependency build CMake are reduced to zero (can leave the OptiX build option off, so it's not looking for a CUDA toolkit or OptiX SDK), just the patch file got a few lines extra. And things should work when sombody uses their own OSL build that has an OSL bitcode module as well, since OSL will simply use the built-ins from that (they are declared with private linkage), instead of the ones provided by Cycles.
Drawback is that services.cu got more messy, with it now containing everything from standard library built-ins, over noise functions, to the stuff from before like closure allocation. And there are some behavioral differences, since e.g. the noise implementation in Cycles doesn't exactly match that of OSL.

Fixed weight calculation in closure stack flattening and cleaned up code

Brecht Van Lommel (brecht) requested changes to this revision.Sep 30 2022, 7:52 PM

Thanks for figuring out a solution to this. Perhaps that additional code in services.cu could eventually be provided in a header by OSL, but not really sure about all the implications of that. Maybe moving all the things like noise into headers might not be wanted, will need to check.

I made the noise implementation in SVM match OSL at some point, so I was expecting them to match. But maybe there were changes so that this is no longer the case.

build_files/build_environment/patches/osl.diff
78–105 ↗(On Diff #56337)

Are these new changes something we could contribute upstream?

I should get around to contributing some of the other changes we have.

intern/cycles/device/multi/device.cpp
216–221

I'd rather no make assumptions about the order of the devices, better to loop over the devices here to find the CPU device.

intern/cycles/device/optix/queue.cpp
31

We should add utility functions like device_kernel_has_shading and device_kernel_has_intersection in device/kernel.h to deduplicate these kinds of checks.

Ideally things like the construction of group_descs, PG_RGEN indexes, could be done without every device backend hardcoding all the kernel names, but that's probably beyond the scope of this patch.

intern/cycles/kernel/osl/services.cu
1 ↗(On Diff #56337)

It's not immediately going to be needed, but it would be good to make the code in this file not CUDA specific.

This file could be named services_gpu.h, and extern "C" __device__ could be replaced with something like ccl_device_osl.

intern/cycles/scene/osl.cpp
202

The OptiX specific part of this should move into the device, though not exactly sure what the device API for that should look like.

This revision now requires changes to proceed.Sep 30 2022, 7:52 PM

Regarding noise implementation, I spoke too soon, scalar noise does match that of OSL perfectly. Just vector noise does not, since OSL splits up the hash used in the noise into 3 components (https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/include/OSL/oslnoise.h#L1101), but AFAICT Cycles doesn't currently have an implementation of that (only has scalar noise).

build_files/build_environment/patches/osl.diff
78–105 ↗(On Diff #56337)

They could be useful for other implementations, so would be useful to contribute upstream I think, yeah.

intern/cycles/device/multi/device.cpp
216–221

The order is technically guaranteed by the implementation, which always adds CPU devices at the back: https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/multi/device.cpp$41
But can still replace this with a reverse loop lookup.

  • Added helper functions to determine if kernel does shading or intersection
  • Moved OptiX specific OSL kernel loading code into the OptiX device implementation
  • Added services_gpu.h header and ccl_device_extern to generalize OSL services for GPU (+ ccl_private everywhere)
Patrick Mours (pmoursnv) marked 3 inline comments as done.Oct 4 2022, 4:18 PM

Thanks for the updates.

Regarding landing this in master, it will be easiest if the code can still build with the old OSL libraries. So that we can add the precompiled libs for each platform after, and also so for example Linux distributions can potentially build against an older OSL.

I think this would mainly involve making the oslnoise lib optional, and adding some #ifdefs. Or maybe OptiX support can even be queried at runtime from OSL, not sure.

This could land in master for Blender 3.5 after October 26, since there's probably too many risky changes for 3.4 even if the feature is disabled.

Brecht Van Lommel (brecht) requested changes to this revision.Oct 4 2022, 7:49 PM
This revision now requires changes to proceed.Oct 4 2022, 7:49 PM

Changed build to support older OSL versions still and hide checkbox for those when OptiX is active

Some build issues to be resolved:
https://builder.blender.org/admin/#/builders/136/builds/192

But may be good to rebase on master first since there may be additional errors from that.

Rebased on master and fixed build issues without OSL or older OptiX SDKs

Fixed missing "services_gpu.h" in CMake

Fixed some compiler warnings

Added support for standard object attributes and fixed various issues with GPU OSL services implementation (classroom scene now renders correctly too, apart from texture filtering differences).

This looks ready for master.

For committing this, we don't want to change build_files/build_environment immediately, but rather do it as part of a bigger library update we are planning (see T99618). That part of the patch should be left out of the commit to master, and instead committed to the tmp-vfx-platform-2023 branch.

I can commit the changes to tmp-vfx-platform-2023 if you want, up to you if you want to do it yourself.

The commit message should clearly describe that while the implementation is being committed to continue development in master, it's still incomplete and waiting for a new OSL library to enable it.

This revision is now accepted and ready to land.Nov 9 2022, 1:28 PM

@Brecht Van Lommel (brecht) If you don't mind, I'd prefer if you could quickly commit the library changes to tmp-vfx-platform-2023 =)

Kicked off a build for the latest state of this patch:
https://builder.blender.org/admin/#/builders/136/builds/208

I tried to download this build off the page but the checkbox for "Open Shading Language" is not visible.

I also downloaded the latest master build I could find, same issue.

I made sure I was using optix. Nothing was showing up.

Kicked off a build for the latest state of this patch:
https://builder.blender.org/admin/#/builders/136/builds/208

I tried to download this build off the page but the checkbox for "Open Shading Language" is not visible.

I also downloaded the latest master build I could find, same issue.

I made sure I was using optix. Nothing was showing up.

For OptiX support, the OSL library has to be updated, which has not been done yet. The builds were mostly to make sure compilation is not broken with older OSL versions.

Kicked off a build for the latest state of this patch:
https://builder.blender.org/admin/#/builders/136/builds/208

I tried to download this build off the page but the checkbox for "Open Shading Language" is not visible.

I also downloaded the latest master build I could find, same issue.

I made sure I was using optix. Nothing was showing up.

For OptiX support, the OSL library has to be updated, which has not been done yet. The builds were mostly to make sure compilation is not broken with older OSL versions.

This I understand, but is there a way for me to test this on my end? Im not worried about instability or anything, Im more just curious about how fast basic loops / logic / math are (Which are notoriously slow as hell using node groups, as we dont have proper logic nodes)

For OptiX support, the OSL library has to be updated, which has not been done yet. The builds were mostly to make sure compilation is not broken with older OSL versions.

Can you show me how I can test this on my end? I dont know if you missed my previous comment. Its been over a month.

I wanted to try this out on my machine but the build doesnt have any of these features available.

I mean, the person who initially made this patch obviously had it running, so is there a way for me to do so aswell?

it's been updated and enabled in the daily 3.5.0 alpha builds on builder.blender.org